[go: up one dir, main page]

0% found this document useful (0 votes)
6 views72 pages

Pda 1

The document outlines a course on Parallel and Distributed Algorithms for undergraduate Computer Science students, taught by Dr. Mihai L. Mocanu. It covers topics such as parallelism, concurrency, distributed computing, and various algorithm design techniques, along with course objectives and learning outcomes. The course aims to equip students with the knowledge and skills necessary to understand and apply parallel and distributed computing concepts effectively.

Uploaded by

Darius Ö/
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views72 pages

Pda 1

The document outlines a course on Parallel and Distributed Algorithms for undergraduate Computer Science students, taught by Dr. Mihai L. Mocanu. It covers topics such as parallelism, concurrency, distributed computing, and various algorithm design techniques, along with course objectives and learning outcomes. The course aims to equip students with the knowledge and skills necessary to understand and apply parallel and distributed computing concepts effectively.

Uploaded by

Darius Ö/
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Parallel and Distributed

Algorithms
Course for Undergraduate Students in the 3rd year
(Major in Computer Science)

Instructor:
Mihai L. Mocanu, Ph.D., Professor
E-mail: mihai.mocanu@edu.ucv.ro

Office hours (by appointment only): Friday13:00-14:00


Course: 2024-2025_ParaDis_3CE (Google Workspace)
What this course is about
 Parallelism/ the ability to think in parallel:
 Classically, algorithm designers assume a computer with only

one processing element


 Algorithms they design are “sequential”, that is, all steps of

the algorithms must be performed in a particular sequence


 But today’s computers have multiple processors, multiple cores etc,

each one can perform its own chain of execution


 Stands for basic desktop computers (commodity comp.), they

have multicore processors or pipelined processing elements


 Not to mention the vast clusters of computers, aggregated in

networks, grids, clouds etc.


 Worth to mention that real world processes may need algorithmic

descriptions that are far from sequential


Concurrent Computing
 Concurrency means multiple computations are happening at
the same time
 Theoretically, computer programming has long been
interested in the concurrency concept
 Pioneering work on Petri Nets in the 1960s was among the
earliest ideas, but other follow (CSP, CCS)
 “While concurrent policy coherence had been contemplated
for decades, the computer programming of concurrency
started with Edsger Dijkstra’s famous 1965 article that
presented the deadlock issue” - Leslie Lamport (2015)
 The history of concurrency is long and there is an enormous
amount of development, on many perspectives
Parallel Computing
 Parallel computing is based on the idea of breaking down a
task into smaller parts that can be processed simultaneously
 The idea of concurrency is connected to parallelism; still, the
two must not be confused: concurrency is the simultaneous
operation of separate processes, while parallelism is the
construction and operation of several sub-processes of the
same process at the same time
 In the actual ceation/ operation of a computer, architectural
intricacies and hardware specifications of the parts often
manifest themselves correspondingly
 Also, the characteristics of parallel computing should be
incorporated in the development as well as assessment of
parallel algorithms
Distributed Computing
 Thanks to the emergence of large network interconnections,
many of the resources required by programming no longer
remain on a single system
 Initially, dispersed systems were built and at some point, a

physical access to remote systems became required


 Distributed systems emerged as collections of autonomous

entities that work together to solve an issue that cannot be


solved separately
 High performance isn’t the main purpose of employing a

distributed system, what is needed is optimal performance


 The user-perceived response time is critical

 Network delay and access to common resources can

cause significant latencies in highly distributed systems,


which should be reduced
Why study P&D Algorithms?
 Suppose we’re working with a commodity multiprocessor system
(a p-processor system – like almost all are today)
 Almost inevitably, we’ll run into some complex problem that we
want it to solve in less time
 Let T represent the amount of time that a single processor takes to
solve this problem; if we divide the work, can we hope to develop
an algorithm that takes T/p time to reach a solution?
 Why this will rarely be possible to achieve entirely, in reality?
 Depending not only on the problem but also on the solution
we adopted, the processors will almost always spend some
time coordinating their work, which will lead to a penalty
Example 1
 Sometimes a near-optimal speedup is easy to achieve
 Suppose that we have an array of integers, and we want to display
all the negative integers in the array
27 -113 45 11 84 -32 -55 18 -7 23

 We can divide the array into equal-sized segments, one


segment for each processor, and each processor can display
all the negative integers in its segment
 The sequential algorithm would take O(n) time
 In our multiprocessor algorithm each processor handles an
array segment with at most n/p elements
 So, processing the whole array takes O(n/p) time
Example 2
 There are some problems where it is hard to figure out how to use
multiple processors to speed up the processing time
 Let’s consider the problem of arbitrary precision arithmetic*: we
are given 2 integer arrays representing very large numbers (where
a[0] contains the 1’s digit, a[1] contains the 10’s digit, a.s.o.) and
we want to construct a new array containing the sum of numbers
 We know a sequential algorithm to do this that takes O(n) time.
 We can’t easily adapt this algorithm to use many processors,
though: In processing each segment of the arrays, we need to know
whether there is a carry from the preceding segments.
 Thus, this problem, like many others, seems inherently sequential.
 Still, there is an algorithm that can add two such big-integers in
O(n/p + log p) time (Q: slowdown with log p?)
PDA Field of Study inside the CS Area
Par.& Distr.
Algorithms
Social and
Professional Database and
Context Information
Algorithms Architecture Retrieval
& Data
Structures
Operating
Systems
Artificial Computer Software
Intelligence Science Methodology
and Robotics &Software
Engineering

Human-
Numerical and Computer
Symbolic Programming Interaction
Computation Languages ACM & IEEE
curriculum
recommendations
Course objectives
To provide you with an introduction and overview to the
computational aspects of parallel and distributed computing
 To introduce several important parallel computing models
 To study the typical models for distributed computing.
To make basic concepts of P & D computing both accessible and
ready for immediate use
 Broadly understand P&D architectures
 Become familiar with typical software/ programming approaches
 Be able to quickly adapt to any programming environment
 Study the main classes of P&D algorithms
 Learn the jargon … understanding what people are talking about
 Be able to apply this knowledge
Course objectives (cont.)
More specific, with the aim of drastically flattening the learning
curve in a parallel and/or distributed environment:
 Study the development methods of efficient P&D algorithms using:
 Data partitioning and functional partitioning
 Parallelism exploitation for different process topologies
 Efficient overcoming of communication latencies,
 Load balancing
 Termination detection
 Failure tolerance etc.
 Learn the conceptual models for the specification and verification
of P&D algorithms
 Understand the complexity models of P&D algorithms
Learning Outcomes
On successfully completing this course you should understand a
number of different models of P&D computing and the basic
techniques for designing algorithms in these models.
 Development techniques for parallel algorithms - examples from
the main classes: prefix algorithms, search, sort, selection, matrix
processing, graph algorithms, lists, trees
 Main development techniques of classes of distributed algorithms:
wave algorithms, mutual exclusion, topology establishment,
termination, fault tolerance, leader election, genetic algorithms
Textbooks and other references
Vipin Kumar, Ananth Grama, Anshul Gupta, George Kyrypis - Introduction to Parallel
Computing Pearson Ed. 2003, (2nd Edition) or Benjamin/Cummings 1994, (1st Edition)
http://srmcse.weebly.com/uploads/8/9/0/9/8909020/introduction_to_parallel_computing_second_editi
on-ananth_grama..pdf , or https://docs.google.com/...
Behrooz Parhami - Introduction to Parallel Processing: Algorithms and Architectures,
Kluwer Academic Publ, 2002
Dan Grigoras – Parallel Computing. From Systems to Applications, Computer Libris
Agora, 2000, ISBN 973-97534-6-9
Mihai Mocanu – Algorithms and Languages for Parallel Processing, Publ. University of
Craiova, updated yearly for 20+ years on my home page
George Coulouris, Jean Dollimore, Tim Kindberg and Gordon Blair, Distributed Systems
Concepts and Design (5th ed.), Addison Wesley, 2011
https://repository.dinus.ac.id/docs/ajar/George-Coulouris-Distributed-Systems-Concepts-and-Design-5th-Edition.pdf

P.Pacheco, M.Malensek, An Introduction to Parallel Programming, MK, 2022


https://books.google.ro/books?id=rElkCwAAQBAJ&pg=PA460&lpg=PA460&dq=Peter+Pacheco,+M
alensek,+Matthew+-+An+Introduction+to+Parallel+Programming++pdf&source=...
… and many others
Working Resources
Laboratory and Projects:
1. Mihai Mocanu, Alexandru Patriciu – Parallel Computing in C for Unix and
Windows NT Networks, Publ. University of Craiova, 1998
2. Christofer H.Nevison et al. - Laboratories for Parallel Computing, Jones and
Bartlett, 1994
3. SPIN multi-threaded software verification tool, http://spinroot.com/spin/
whatispin.html

Other resources are on the web page


Topics

1. Parallel Programming Platforms & Parallel Models


 logical and physical organization
 interconnection networks for parallel machines
 communication costs in parallel machines
 graph models: process - processor mappings, graph embeddings
 PRAMs. Complexity measures

Why?
It is better to be aware of the physical and economical constraints and tradeoffs
of the parallel system you are designing for, not to be sorry later.
Topics (continued)

2. Quick Introduction to Virtual Computing Environments


 heterogeneous platforms: understand and set up working environment
 semantics and syntax of basic communication operations
 PVM (Parallel Virtual Machine) vs MPI (Message Passing Interface)
 compiling and running PVM or MPI programs
 other parallel execution models/ programming interfaces
 pthreads (low-level) vs. OpenMP (high-level)
 CPU/ GPU/ DSP programming: CUDA vs. OpenCL
etc.
Why?
You can start to program simple parallel programs early on.
Topics (cont.)

3. Principles of Parallel Algorithm Design


 basic PRAM techniques: doubling technique
 summation trees and prefix summation
 decomposition techniques
 load balancing
 techniques for reducing communication overhead
 parallel algorithm models

Why?
These are fundamental issues that appear/apply to every parallel program.
You really should learn this stuff by hearth.
Topics (cont.)

4. Implementation & Cost of Basic Communication Operations


 broadcast, reduction, scatter, gather, parallel prefix, …
 interconnection networks: graph models of networks
 network properties

Why?
These are fundamental primitives you would often use and you should known
them well: Not only what they do, but also how much do they cost and when and
how to use them.
Going through details of implementation allows us to see how are the principles
from the previous topic applied to relatively simple problems.
Topics (cont.)

5. Analytical Modeling of Parallel Programs


 sources of overhead
 execution time, speedup, efficiency, cost, Amdahl's law

Why?
Parallel programming is done to increase performance.
Debugging and profiling is extremely difficult in parallel setting, so it is better
to understand from the beginning what performance to expect from a given
parallel program and, more generally, how to design parallel programs with
low execution time. It is also important to know the limits of what can be done
and what not.
Topics (cont.)
6. Parallel Dense Matrix Algorithms
 matrix vector multiplication
 matrix matrix multiplication
 solving systems of linear equations
7. Parallel Graph Algorithms
 minimum spanning tree
 single-source shortest paths
 all-pairs shortest paths
 connected components
 algorithms for sparse graphs
Why?
Classical problems with lots of applications, many interesting and useful
techniques exposed.
Topics (cont.)
8. Parallel Sorting
 odd-even transpositions sort
 sorting networks, bitonic sort
 parallel quicksort, bucket and sample sort
9. Search Algorithms for Discrete Optimization Problems
 search overhead factor, speedup anomalies
 parallel depth-first search
 sorting and searching on PRAMs
10. Geometric Algorithms
 convex hulls, closest pair of points
Why?
As before, plus shows many examples of hard-to-parallelize problems.
Topics (cont.)
11. Concepts of distributed computation
 termination detection
 failure tolerance
 distributed network topologies
12. Formal Models
 state transitions and executions
 asynchronous and synchronous models
 causality constraints and the computation theorem
 logical clocks (Lamport) and vector clocks (Mattern-Fidge)
Why?
It is important and useful to understand their social importance for Internet,
WWW, small devices (mobiles, sensors) and their technical importance
that aims to improve scalability, reliability through inherent distribution
Topics (cont.)
13. Distributed abstractions
 event-based component model
 specification of services
 safety and liveness
 node and channel failure models and links
 timing assumptions
14. Failure detectors
 classes of detectors
 leader elections and reductions
 quorums; byzantine quorums and leader election
Why?
Making use of abstractions, like timing assumptions and ways to encapsulate
them (failure detectors), proves to be very useful.
Topics (cont.)
15. Broadcast abstractions
 (lazy/ eager/ uniform/ fail-silent) reliable broadcast
 causal broadcast, vector-clock algorithms
 performance of algorithms
16. Consistency
 replicated shared memory
 linearizable registers
 linearizable single write multiple reader
 multiple writers
17. Consensus
 control-oriented vs. event-based, uniform consensus
 Paxos and multi-Paxos consensus
Topics (cont.)
18. Reconfigurable Replicated State Machines
 total order broadcasting
 robust distributed networks
19. Random Walks
 introduction to Markov processes
 random walks (hitting time, cover time (s.t)-connectivity)
20. Time and Clocks in Distributed Systems
 time lease
 clock drift and the complete algorithm

Grading (tentative)
 20% practical laboratory assignments/ projects (L)
 20% periodic evaluation through homeworks/exercises (H)
 20% final test quizz (T)
 40% final exam (E)
You have to get at least 50% on final laboratory evaluation (L) in
order to be allowed to sustain the final exam in the next session.
You have to get at least 50% on the final exam (E) to pass and to
obtain a mark greater than 5. All the grades obtained go with the
specified weight into the computation of the final mark.
If you have problems with setting up your working environment and/or
running your programs, ask the TA for help/advice. He is there to help you
with that. Use him, but do not abuse him with normal programming bugs
Assignments and evaluations

 assignments from individual homeworks (may be a project) for a


total of 20 points
 mostly programming in C, C++ with:
• threads (POSIX Threads and OpenMP)
• multiple processes, PVM, MPI, etc. implementing
(relatively) simple algorithms and load balancing techniques
• task-based and data-based parallelism (OpenCL)
• CUDA,
 deep understanding of algorithms and performance models
continuous evaluation based on theoretical questions thrown in
to prepare you better for the final exam
Introduction
Parallel vs. Distributed Computing
 Background. “Traditional” definitions. Code examples
 Speedup. Amdahl’s law. Performance Measures

 Parallel Platforms for High Performance Computing

 The context and difficulties of parallel computing

 HPC/ parallel computing evolution over 50 years

 Grand challenge problems in demand for HPC (higher

computational speed)
 A few core problems in Distributed Computing
Background
 Parallel Computer: a computer with many processors that
are closely connected
 frequently, all processors share the same memory
 they also communicate by accessing this shared memory
 Examples of parallel computers include the multicore
processors found in many computers today (even cheap
ones), as well as many graphics processing units (GPUs)
 Parallel Computing: “using more than one computer, or a
computer with more than one processor, to solve a task ”
 The parallel programming paradigm is not new: it has been
around for more than 50 years! Motives: faster computation,
larger amount of memory available, etc.
“... There is therefore nothing new in the idea of parallel
programming, but its application to computers. The author
cannot believe that there will be any insuperable difficulty in
extending it to computers. It is not to be expected that the
necessary programming techniques will be worked out
overnight. Much experimenting remains to be done. After all,
the techniques that are commonly used in programming today
were only won at the cost of considerable toil several years
ago. In fact the advent of parallel programming may do
something to revive the pioneering spirit in programming
which seems at the present to be degenerating into a rather
dull and routine occupation ...”

Gill, S. (1958), “Parallel Programming,” The Computer Journal, vol. 1, April 1958, pp. 2-10.
Background
 Distributed Computer: one in which the processors are less
strongly connected –“a set of nodes connected by a network,
which appear to its users as a single coherent system”
 A typical distributed system consists of many independent
computers, attached via network connections
 Such an arrangement is called a cluster
 In a distributed system, each processor has its own memory.
This precludes using shared memory for communicating
 Processors instead communicate by sending messages
 Although message passing is slower than shared memory, it
scales better for many processors, and it is cheaper
PARALLEL vs. Distributed Computing

A working definition may differentiate them after:


• Focus (relative): coarse, medium or fine grain
• Main goal: shorter running time!
• The processors are contributing to the implementation of a more
efficient execution of a solution to the same problem
• In parallel computing a problem involves lots of computations and
data (e.g. matrix multiplication, sorting), and, to attain efficiency,
communication must be kept to a minimum (optimal)
• In distributed systems the problems are different: often coordination
of resources is the most important (e.g. leader election, commit,
termination detection…)
Code example 1
 For a shared-memory parallel computer, here you have a code
intended to be used in finding the sum of all the elements in a
long array.
 Variables whose name begin with my_ are specific to each
processor; this might be implemented by storing these variables in
individual processors’ registers
 The code fragment assumes that a variable array has already been
set up with the numbers we want to add; there is a variable procs
that indicates how many processors our system has
 In addition, we assume each register has its own my_pid variable,
which stores that processor’s own processor ID, a unique number
between 0 and procs – 1
Parallel program
// 1. Determine where processor’s segment is and add up numbers in segment
count = array.length / procs;
my_start = my_pid * count;
my_total = array[my_start];
for(my_i = 1; my_i<count; my_i++) my_total += array[my_start+ my_i];

// 2. Store subtotal into shared array, then add up the subtotals.


subtotals[my_pid] = my_total; // denoted as line A in remarks below
my_total = subtotals[0]; // line B in remarks below
for(my_i = 1; my_i < procs; my_i++) {
my_total += subtotals[my_i]; } // line C in remarks below

// 3. If array.length isn’t a multiple of procs, then total will exclude some


// elements at the end of the array. Add these last elements in now.
for(my_i = procs * count; my_i < array.length; my_i++) my_total +=
array[my_i];
Analysis: computation
 Here, we first divide the array into segments of length count, and
each processor adds up the elements within its segment, placing
that into its variable my_total.
 We write this variable into shared memory in line A so that all
processors can read it; then we go through this shared array of
subtotals to find the total of the subtotals.
 The last step is to take care of any numbers that may have been
excluded when we tried to divide the array into p equally-sized
segments.
Analysis: synchronization
 Each processor must complete line A before any other processor
tries to use that saved value in line B or line C (discuss!)
 One way of ensuring this is to build the computer so that all
processors share the same program counter as they step through
identical programs. Such an approach:
 Allows all processors to execute line A simultaneously.
 Though it works, it is quite rare because it can be difficult to
write a program so that all processors work identically
 We often want different processors to perform different tasks
 The more common approach is to allow each processor to execute
at its own pace, giving programmers the responsibility to include
code enforcing dependencies between processors’ work *
*
In our example above, we would add code between line A
and line B to enforce the restriction that all processors
complete line A before any proceed to line B and line C.
If we were using Java’s built-in features for supporting
such synchronization between threads, we could accomplish
this by introducing a new shared variable number_saved
whose value starts out at 0. The code following line A could
be as follows.
synchronized(subtotals) {
number_saved++;
if(number_saved == procs)
subtotals.notifyAll();
while(number_saved < procs) {
try { subtotals.wait(); }
catch(InterruptedException e) { }
}
}
Code example 2
 From now on, we’ll be working with a message-passing system
implemented using the following two functions
void send(int dst_pid, int data) – sends a message containing the
integer data to the processor whose ID is dst_pid. Note that the
function’s return may be delayed until the receiving processor
requests to receive the data — though the message might instead
be buffered so that the function can return immediately
int receive(int src_pid) – waits until the processor whose ID is
src_pid sends a message and returns the integer in that message
(blocking receive). Some systems also support a non-blocking
receive, which returns immediately if the processor hasn’t yet
sent a message. Another variation is a receive that allows a
program to receive the first message sent by any processor*
Distributed code
 On the same example, we assume that each processor already has
its segment of the array in its memory, called segment
 The variable procs holds the number of processors in the system,
and pid holds the processor’s ID (a unique integer between 0 and
procs – 1, as before)
total = segment[0];
for(i = 1; i < segment.length; i++) total += segment[i];
if(pid > 0) { // each processor but 0 sends its total to processor 0
send(0, total);
} else { // processor 0 adds all these totals up
for(int k = 1; k < procs; k++) total += receive(k);
}
Analysis
 This code says that each processor should first add the elements
of its segment. Then each processor except processor 0 should
send its total to processor 0
 Processor 0 waits to receive each of these messages in succession,
adding the total of that processor’s segment into its total. By the
end, processor 0 will have the total of all segments
 In a large distributed system, this approach may be flawed since
inevitably there is a possibility that some processors would break,
often due to the failure of some equipment such as a hard disk or
power supply
 We’ll ignore this issue here, but it is an important issue when
writing programs for large distributed systems in real life
Performance view
Parallel System: An optimized collection of processors, dedicated
to the execution of complex tasks; each processor executes in a
semi-independent manner a subtask and co-ordination may be
needed from time to time. The primary goal of parallel processing
is a significant increase in performance.

Distributed System: A collection of multiple autonomous


computers, communicating through a computer network, that
interact with each other in order to achieve a common goal.
Leslie Lamport: “A distributed system is one in which the failure
of a computer you didn't even know existed can render your own
computer unusable “

Remark. Parallel processing in distributed environments is not


only possible but also a cost-effective attractive alternative
Speedup Factor
Execution time using one processor (best sequential algor
ithm) ts
S(p) = =
Execution time using a multiprocessor with p processors tp
Speedup factor can also be cast in terms of computational steps:
Number of computational steps using one processor
S(p) =
Number of parallel computational steps with p processors

 S(p) gives increase in speed by using “a multiprocessor”


Hints:
 Use best sequential algorithm with single processor system

 Underlying algorithm for parallel implementation might be (and it

is usually) different
Maximum Speedup
Is usually p with p processors (linear speedup)
Speedup factor is given by:
ts p
S(p)  
fts  (1  f )ts /p 1  (p  1)f
This equation is known as Amdahl’s law
Remark: Possible but unusual to get superlinear speedup
(greater than p) but due to a specific reason such as:
Extra memory in multiprocessor system
 Nondeterministic algorithm
Maximum Speedup
Amdahl’s law
ts
fts (1 - f)ts

Serial section Parallelizable sections


(a) One processor

(b) Multiple
processors

p processors

tp (1 - f)ts /p
Speedup against number of processors

•Even with infinite number of processors, maximum speedup limited to 1/f


•Ex: With only 5% of computation being serial, maximum speedup is 20
*
 This is a perfect example of how NOT to multithread!
 You are creating a new task for each iteration of a
recursive function
 So, each task creates a new task, waits for that task to
finish and then adds the numbers from the result
 Each thread has two jobs :
1 - to create a new thread
2 - to add two numbers.
 The overhead cost for creating each thread is going to
far outweigh the cost of adding two numbers together
 Solutions:
1. Limit the number of threads
2. Use the iterative algorithm to start from, instead
of the recursive one
Superlinear Speedup - Searching

(a) Searching each sub-space sequentially

Start Time

ts
t s/p
Sub-space t
search

xts/p
Solution found

x indeterminate
(b) Searching each sub-space in parallel ts
x  t
Speedup is given by: p
S ( p) 
t
Worst case for sequential search when solution found in last
sub-space search. Then the parallel version offers the greatest
benefit, i.e.

Least advantage for parallel version when solution found in


first sub-space search of the sequential search, i.e.
t

Solution found
Measures of Performance

Speedup is obviously a MoP, but what is the really relevant from


the performance pov?
• To computer scientists: speedup, execution time
• To applications people: size of problem, accuracy of solution etc.
Speedup of algorithm
= sequential execution time/execution time on p processors (with
the same data set).
Speedup on problem
= sequential execution time of best known sequential algorithm /
execution time on p processors.
 A more honest measure of performance.

 Avoids picking an easily parallelizable algorithm with poor

sequential execution time.


The Context of Parallel Processing
Facts:
• The explosive growth of digital computer architectures
• Therefore, the need for:
• a better understanding of concurrency
• user-friendliness, compactness and simplicity of code
• high performance computing (HPC) but low cost,
low power consumption a.o.
• HPC uni-processors are increasingly complex & expensive
• they have high power-consumption
• they may also be under-utilized - mainly due to the
lack of appropriate software.
Increasing performance of microprocessors
 From 1986 to 2003, the performance of microprocessors
increased, on average, more than 50% per year [Henn, 2019*]
 Since 2003, performance improvement has slowed to the point
that from 2015 to 2017, it increased at less than 4% per year
 The first unprecedented increase could mean that software
developers might simply wait for the next generation of
microprocessors to increase performance of their applications?
 This difference between the two is dramatic: at 50% per year,
performance will increase by almost a factor of 60 in 10 years,
while at 4%, it will increase by about a factor of 1.5

* John Hennessy, David Patterson, Computer Architecture: A


Quantitative Approach. Morgan Kaufmann, 6th ed. (2019)
Possible trade-offs to achieve efficiency
What’s better?
 The use of one or a small number of such complex processors,

OR
 A moderate to very large number of simpler processors

 The answer may seem simple, but there is a clue, forcing us to

answer first to another question:


 How “good” is communication between processors?

So:
 When combined with a high-bandwidth, but logically simple,

inter-processor communication facility, the latter approach


may lead to significant increase in efficiency, not only at the
execution but also in earlier stages (i.e. in the design process)
The Difficulties of Parallel Processing
Two were the major problems that have prevented over the years
the immediate and widespread adoption of such (moderately to)
massively parallel architectures:
 the inter-processor communication bottleneck

 the difficulty of algorithmic/ software development – this may

come at a very high cost


How were these problems overcomed?
 At very high clock rates, the link between the processor and

memory becomes very critical


- integrated processor/memory design optimization
- emergence of multiple-processor microchips
 The emergence of standard programming and communication

models has removed some of the concerns with compatibility


and software design issues in parallel processing
Increasing performance of applications

 Continuous demand for greater computational speed


from a computer system made things possible
 The number of problems that we can seriously consider
to solve solving also increases
 Remember: Computations must not only be completed,
but completed within a “reasonable” time period
 Areas requiring great computational speed include
numerical modeling and simulation, scientific and
engineering problems, accurate medical imaging, Web
search, gaming, and many others
Grand Challenge Problems
Typically, they cannot be solved in a reasonable amount
of time with today’s computers. Obviously, an
execution time of 2 months is always unreasonable
Examples:
 Global weather forecasting/ Climate modeling

 Data analysis

 Modeling large DNA structures/ Protein folding

 Modeling motion of astronomical bodies

 Pharmaceutical research for drugs and vaccines

 Research for efficient clean energy sources, etc.


Global Weather Forecasting
 Atmosphere modeled by dividing it into 3-dim. cells
 Computations in each cell repeat many times to model
time passing
 Suppose whole global atmosphere divided into cubic cells of
size 1 km to a height of 15 km - about 7.5 x 109 cells
(Land Surface = 510.1 millions km2 = 196.9 millions sq. miles)
 Suppose each calculation requires 200 float. point operations.
In one time step, 15x1013 floating point operations necessary
 To forecast weather over 7 days using 10-minute intervals (103
minutes), a computer operating at 1Gflops (109 flops) takes
15x107 s ≈250 days. To perform calculation in under 30 min.
needs computer operating at 100 Tflops (1012 flops, att. 2000)
Data Analysis
 We generate and store every year a huge amount of data;
by some estimates, it doubles worldwide every 2 years
 Most data is largely useless unless it's analyzed

Examples:
 Knowing the sequence of nucleotides in DNA is, by

itself, of little use. Understanding how this sequence


affects our development and how it can cause disease
requires extensive analysis
 In addition to genomics, huge quantities of data are

generated by particle colliders, such as the Large Hadron


Collider at CERN, astronomical research, and so on
Modeling Motion of Astronomical Bodies
 Bodies are attracted to each others by gravitational forces
 Movement of each body is predicted by calculating total
force on each body
 With N bodies, N - 1 forces to calculate for each body, or
approx. N2 calculations (N log2 N for an efficient approx.
algorithm); after determining new positions of bodies,
calculations repeated
 If a galaxy might have, say, 1011 stars, even if each
calculation done in 1 ms (extremely optimistic figure), it
takes 109 years for one iteration using N2 algorithm and
almost a year for one iteration using an efficient N log2 N
approximate algorithm.
Astrophysical N-body simulation – screen snapshot
Pharmaceutical Research
 There are many ways in which increased computational
power can be used in research into new drugs, vaccines
or medical treatments.
 I.e., fighting COVID-19 required extensive research in
areas like bioinformatics, epidemiology, and molecular
modeling to understand the threat we’re facing and to
develop strategies to address it, but also…
 …bringing together the Federal government, industry,
and academic leaders to provide access to the world’s
most powerful HPC resources in support of research
 See https://covid19-hpc-consortium.org/ for details
Wide Range of Applications that really
depend now on HPC (Dongarra, 2017*)
• Airplane wing design
• Quantum chemistry
• Geophysical flows
• Noise reduction
• Diffusion of solid bodies in a liquid
• Computational materials research
• Weather forecasting
• Deep learning in neural networks
• Stochastic simulation
• Massively parallel data mining

Do we need powerful, HPC?

Yes, to solve much bigger problems much faster!


“Fine grain” and “coarse-grain” parallelism are different, i.e. the
latter is mainly applicable to long-running, scientific programs
Performance
- there are problems which can use any amount of
computing (i.e. simulation)
Capability
- to solve previously unsolvable problems (such as prime
number factorization): too big data sizes, real time
constraints
Capacity
- to handle a lot of processing much faster, perform more
precise computer simulations (e.g. weather prediction)
Why are HPC computers parallel?
From Transistors to FLOPS
 By Moore’s law the no of transistors per area doubles every 18
months, so: how to make better use of these transistors?
 more execution units, graphical pipelines, processors etc.
But technology is not the only key, computer structure (architecture)
and organization are also important! (see next slide)
 There are also inhibitors of parallelism, such as dependencies
The Data Communication Argument
 Move computation towards data: for huge data, it is cheaper
The Memory/Disk Speed Argument
 Parallel computers yield better memory system performance,
because of larger aggregate caches, higher aggregate bandwidth
How did parallel computers evolved?

Execution Speed
With >102 times increase of (floating point) execution speed every

<10 years (more than the 26 increase suggested by Moore’s law)


Communication Technology
A factor which is critical to the performance of parallel

computing platforms
 1985 – 1990 : in spite of an average 20x increase in processor
performance, the communication speed kept constant
Parallel Computing – How exectime decreased
Time(s) per
fp instruction
Motto: “I think there is a world market for maybe five
computers” (Thomas Watson, IBM Chairman, 1943)
Towards Parallel Computing – The 5 ERAs
More on the historical perspective
Parallel computing has been here since the early days of computing.
Traditionally: custom HW, custom SW, high prices
The “doom” of the Moore law:
- custom HW has hard time catching up with the commodity processors
Current trend: use commodity HW components, standardize SW
 Parallelism sneaking into commodity computers:
• Instruction Level Parallelism - wide issue, pipelining, OOO
• Data Level Parallelism – 3DNow, Altivec
• Thread Level Parallelism – Hyper-threading in Pentium IV
 Transistor budgets allow for multiple processor cores on a chip.
More on historical perspective (cont.)
Most applications would benefit from being parallelized and
executed on a parallel computer.
• even PC applications, especially the most demanding ones – games,
multimedia
Chicken & Egg Problem:
1. Why build parallel computers when the applications are sequential?
2. Why parallelize applications when there are no parallel commodity
computers?
Answers:
1. What else to do with all those transistors?
2. Applications already are a bit parallel (wide issue, multimedia
instructions, hyper-threading), and this bit is growing.
Core Problems in Distributed Computing
• What types of problems are there?
• An example: Generals’ Problem
Two generals need to coordinate an attack
1. Must agree on time to attack
2. They’ll win only if they attack simultaneously
3. Communicate through messengers
4. Messengers may be killed on their way
Lets try to solve it for general g1 and g2
step 1: g1 sends time of attack to g2
Problem: how to ensure g2 received msg?
step 2 (Solution): let g2 ack receipt of msg
Problem: how to ensure g1 received ack
step 3 (Solution): let g1 ack the receipt of the ack…

This problem seems (is!) really impossible to solve!
Consensus
 Two nodes need to agree on a value
 Communicate by messages using an unreliable channel
Agreement is a core problem… agreeing on a number = consensus
 The Consensus problem
 All nodes propose a value
 Some nodes might crash & stop responding
 The algorithm must ensure:
 All correct nodes eventually decide
 Every node decides the same
 Only decide on proposed values
Broadcast
 Atomic Broadcast
 A node broadcasts a message
 If sender correct, all correct nodes deliver msg
 All correct nodes deliver same messages
 Messages delivered in the same order
 Given Atomic broadcast
 Can use it to solve Consensus
 Every node broadcasts its proposal
 Decide on the first received proposal
 Messages received same order
 All nodes will decide the same
 Given Consensus
 Can use it to solve Atomic broadcast
 Atomic Broadcast equivalent to Consensus

You might also like