0% found this document useful (0 votes)

61 views23 pages

Pap 3 Shared Memory Algos

The document discusses parallel algorithms and the PRAM model of computation. It describes parallel algorithms for problems like list ranking, pointer jumping, and parallel prefix sums. It also analyzes the performance of parallel prefix sum algorithms and different PRAM models.

Uploaded by

bivakarmahapatra7872

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views23 pages

Pap 3 Shared Memory Algos

Uploaded by

bivakarmahapatra7872

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Parallel Algorithms and

Programming

Parallel algorithms in shared

memory
Thomas Ropars

Email: thomas.ropars@univ-grenoble-alpes.fr

Website: tropars.github.io

1
References
The content of this lecture is inspired by:
Parallel algorithms (Chapter 1) by H. Casanova, Y. Robert, A. Legrand.
A survey of parallel algorithms for shared-memory machines by R.
Karp, V. Ramachandran.
Parallel Algorithms by G. Blelloch and B. Maggs.
Data Parallel Thinking by K. Fatahalian

2
Outline
The PRAM model

Some shared-memory algorithms

Analysis of PRAM models

3
Need for a model
A parallel algorithm
De nes multiple operations to be executed in each step
Includes communication/coordination between the processing units

The problem
A wide variety of parallel architectures
Di erent number of processing units
Multiple network topologies

How to reason about parallel algorithms?

How to avoid designing algorithms that would work only for one
architecture?

A model can be used to abstract away some of the complexity

Should still capture enough details to predict with a reasonable
accuracy how the algorithm will perform
4
A model for shared memory
computation
The PRAM model

Parallel RAM
A shared central memory
A set of processing units (PUs)
Any PU can access any memory location in one unit of time
The number of PUs and the size of the memory is unbounded

5
Details about the PRAM model
Lock-step execution
A 3-phase cycle:
1. Read memory cells
2. Run local computations
3. Write to the shared memory
All PUs execute these steps synchronously
No need for explicit synchronization

About concurrent accesses to memory: 3 PRAM models

CREW: Concurrent Read, Exclusive Write
CRCW: Concurrent Read, Concurrent Write
Semantic of concurrent writes?
EREW: Exclusive Read, Exclusive Write

6
About the CRCW model
Semantic of concurrent writes:
Arbitrary mode : Select one value from the concurrent writes
Priority mode : Select the value of the PU with the lowest index
Fusion mode : A commutative and associative operation is applied to the
values (logical OR, AND, sum, maximum, etc.)

How powerful are the di erent models:

C RC W > C RE W > E RE W

A model is more powerful if there is one problem for which this model
allows implementing a strictly faster solution with the same number of PUs

7
Some shared-memory
algorithms

8
List ranking
Description of the problem
A linked list of n objects
Doubly-linked list
We want to compute the distance of each element to the end of the list

The sequential solution

Iterate through the list from the end to the beginning
Assign each element a distance from the last element while iterating
This solution has a complexity (execution time) in O(n)

Can we do better with a parallel algorithm?

9
List ranking

A solution based on pointer jumping

# the list is stored in array next

# the distances are stored in array *d*
Ranking()
forall i in parallel: # initialization
if next[i] is None:
d[i] = 0
else:
d[i] = 1
while there exists a node i such that next[i] != None:
forall i in parallel do:
if next[i] != None:
d[i] = d[i] + d[next[i]]
next[i] = next[next[i]] # pointer jumping

This solution has an execution time in O(log n)

Note that the solution requires n PUs

We note that the parallel version requires more work than the sequential
version of the algorithm

Credit: Parallel algorithms, Casanova, Robert, Legrand.

10
Comments on the previous algorithm
Implementing pointer jumping

forall i in parallel:
next[i] = next[next[i]]

In practice, if all processors do not execute synchronously,

next[next[i]] may be overwritten by another PU before it is read here.
To make the algorithm safe in practice, we would have to implement:

forall i in parallel:
temp[i] = next[next[i]]
forall i in parallel:
next[i] = temp[i]

11
Comments on the previous algorithm
About the termination test
Note that the test in the while loop can be done in constant time only
in the CRCW model
The problem is about having all PUs sharing the result of their local test
(next[i] != None)
In a CW model, all PUs can write to the same variable and a fusion
operation can be used
In a EW model, the results of the tests can only aggregated two-by-two
leading to a solution with a complexity in O(log n) for this operation

12
Point to root
Description of the problem
A tree data structure
Each node should get a pointer to the root

Use of pointer jumping

PointToRoot(P):
for k in 1..ceiling(log(sizeof(P))):
forall i in parallel:
P[i] = P[P[i]]

We assume that we know sizeof(P)

13
Divide and conquer
Split the problems into sub-problems that can be solved independently
Merge the solutions

Example: Mergesort

Mergesort(A):
if sizeof(A) is 1:
return A
else:
Do in parallel:
L = Mergesort(A[0 .. sizeof(A)/2])
R = Mergesort(A[sizeof(A)/2 .. sizeof(A)])
Merge(L,R)

It is usually important to parallelize the divide and the merge step:

In the algorithm above, the merge step is going to be the bottleneck

14
Analysis of PRAM models

15
Comparison of PRAM models
CRCW vs CREW
To compare CRCW and CREW, we consider a reduce operation over n
elements with an associative operation.
Example: the sum of n elements

With CRCW: O(1) steps

With CREW: O(log n) steps

16
Comparison of PRAM models
CREW vs EREW
To compare CREW and EREW, we consider the problem of determining
whether an element e belongs to a set (e1 , . . . en ) .
Solution with CREW:
A boolean res is initialized to false and n PUs are used
PU k runs the test (ek == e )
If one PU nds e, it sets res to true
Solution with EREW:
Same algorithm except e cannot be read simultaneously by multiple
PUs
n copies of e should be created (broadcast)

With CREW: O(1) steps

With EREW: O(log n) steps

17
Limits of the PRAM model
Unrealistic memory model
Constant time access for all memory location

Synchronous execution
Removes some exibility

Unlimited amount of resources

Might not allow devising an algorithm that works well on a real
system

18
Study of Parallel scans

19
Scans (Pre x sums)
Description of the problem
Inputs:
A sequence of elements x1 , x2 . . . xn
A associative operation *
Output:
A sequence of elements y1 , y2 . . . yn such that yk = x1 ∗ x2 . . . ∗xk

Solution applying the pointer jumping technique

Scan(L):
forall i in parallel: # initialization
y[i] = x[i]

for k in 1..ceiling(log(sizeof(L))):
forall i in parallel:
if next[i] != None:
y[next[i]] = y[i] * y[next[i]]
next[i] = next[next[i]]

20
Scans (Pre x sums)
Performance of this algorithm
Work:

W (n) = O(n) × log(n)

Depth:

D(n) = log(n)

If we do not have n processing units in practice, the large value of n

can be an issue for performance

For instance, what would be a good algorithm on two processing

units?

21
Parallel scan with 2 processing units
Solution

Scan(L):
# input: x; output: y
# first phase
half = sizeof(L)/2
for i in 0..1 in parallel
SequentialScan(x[half*i .. half*(i+1)-1])

# second phase
base = y[half]
quarter = half / 2
for i in 0..1 in parallel
add base to elems in y[half+quarter*i .. half+quarter*(i+1)-1]

Performance of this algorithm

Work: W(n) = O(n)
Depth: D(n) = O(n)
It will perform better in practice due to the reduced amount of work
Improves the locality of the data accesses (good for prefetchers)
Credit: Lecture -- Data parallel thinking, Fatahalian.
22
Performance comparison
Assumptions for the computation
Read 2 elements, compute the sum, and write back the result in 1 step
Array of 1000 elements

Execution time as a function of the number of PUs

The algorithm with a larger depth and less work per iteration
performs better up to 16 PUs 23

Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
1 Parallel and Distributed Computation
No ratings yet
1 Parallel and Distributed Computation
10 pages
Parallel Algorithms
No ratings yet
Parallel Algorithms
19 pages
n32 Parallel
No ratings yet
n32 Parallel
16 pages
Parallel ALgs
No ratings yet
Parallel ALgs
16 pages
Chapter 02
No ratings yet
Chapter 02
47 pages
Parallel Computing: Algorithmic Models
No ratings yet
Parallel Computing: Algorithmic Models
41 pages
Parallel Algorithm Design Techniques
No ratings yet
Parallel Algorithm Design Techniques
13 pages
Parallel Algorithms: Theory and Practice
No ratings yet
Parallel Algorithms: Theory and Practice
44 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Assignment of Algorithm
No ratings yet
Assignment of Algorithm
9 pages
PRAM Parallel Computing Algorithms
No ratings yet
PRAM Parallel Computing Algorithms
49 pages
Lecture 9 - Parallel Algorithms
No ratings yet
Lecture 9 - Parallel Algorithms
28 pages
Pram
No ratings yet
Pram
22 pages
Pda 3
No ratings yet
Pda 3
90 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Notes 03
No ratings yet
Notes 03
3 pages
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
No ratings yet
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
11 pages
1.1 Parallelism Is Ubiquitous
No ratings yet
1.1 Parallelism Is Ubiquitous
3 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
No ratings yet
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
51 pages
L8 Parallel Algorithms
No ratings yet
L8 Parallel Algorithms
41 pages
PRAM Algorithms
100% (1)
PRAM Algorithms
24 pages
Par Seq Algorithms
No ratings yet
Par Seq Algorithms
44 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
Abstract Machine Models in Parallel Computing
No ratings yet
Abstract Machine Models in Parallel Computing
48 pages
Parallel Algorithms for PRAM Models
No ratings yet
Parallel Algorithms for PRAM Models
4 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
Week5 Lec14
No ratings yet
Week5 Lec14
27 pages
Parallel
No ratings yet
Parallel
59 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
PRAM and RAM Models Explained
No ratings yet
PRAM and RAM Models Explained
17 pages
Co 2
No ratings yet
Co 2
22 pages
The PRAM Model and Algorithms: Advanced Topics Spring 2008
No ratings yet
The PRAM Model and Algorithms: Advanced Topics Spring 2008
24 pages
Parallel Thinking: Guy Blelloch Carnegie Mellon University
No ratings yet
Parallel Thinking: Guy Blelloch Carnegie Mellon University
37 pages
Parallel Thinking: Guy Blelloch Carnegie Mellon University
No ratings yet
Parallel Thinking: Guy Blelloch Carnegie Mellon University
41 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Advanced Parallel Algorithms
No ratings yet
Advanced Parallel Algorithms
56 pages
Parallel Computation Models Explained
No ratings yet
Parallel Computation Models Explained
3 pages
Three
No ratings yet
Three
10 pages
Pda 4
No ratings yet
Pda 4
82 pages
Parallel Algorithm Main Single
No ratings yet
Parallel Algorithm Main Single
289 pages
Parallel Merge Sort
No ratings yet
Parallel Merge Sort
6 pages
Ram, Pram, and Logp Models
No ratings yet
Ram, Pram, and Logp Models
72 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
Case Study
33% (3)
Case Study
4 pages
PRAM and Distributed Computing Report
No ratings yet
PRAM and Distributed Computing Report
5 pages
Lecture Parallelism DC PDF
No ratings yet
Lecture Parallelism DC PDF
7 pages
1 Overview, Models of Computation, Brent's Theorem
No ratings yet
1 Overview, Models of Computation, Brent's Theorem
8 pages
Parallel Algorithms Course Guide
No ratings yet
Parallel Algorithms Course Guide
13 pages
Parallel Algorithm Merged
No ratings yet
Parallel Algorithm Merged
76 pages
Pram
No ratings yet
Pram
23 pages
Lecture 10
No ratings yet
Lecture 10
40 pages
ADA Lab Manual Updated 2023-24 NEW
No ratings yet
ADA Lab Manual Updated 2023-24 NEW
36 pages
Sheet 2: Problem 1: Matrix Multiplication Using CREW PRAM
No ratings yet
Sheet 2: Problem 1: Matrix Multiplication Using CREW PRAM
3 pages
Bert 2a Parallel Algorithms Parfor Quicksort Reduction Listranking Rootfinding Postordernumbering
No ratings yet
Bert 2a Parallel Algorithms Parfor Quicksort Reduction Listranking Rootfinding Postordernumbering
73 pages
IC Unit6 DeepLearning
No ratings yet
IC Unit6 DeepLearning
35 pages
Ques - With Ansy
No ratings yet
Ques - With Ansy
16 pages
Microcontroller Memory Guide
No ratings yet
Microcontroller Memory Guide
4 pages
An Empirical Study of Code Migration (JS To TS)
No ratings yet
An Empirical Study of Code Migration (JS To TS)
3 pages
A Novel Evolutionary Algorithm With Column and Sub-Block Local Search For Sudoku Puzzles
No ratings yet
A Novel Evolutionary Algorithm With Column and Sub-Block Local Search For Sudoku Puzzles
12 pages
NewSyllabus 1116201472643069
No ratings yet
NewSyllabus 1116201472643069
6 pages
Matroids
No ratings yet
Matroids
5 pages
Lecture AI - Handling Uncertainities
No ratings yet
Lecture AI - Handling Uncertainities
14 pages
OSY Winter 23
No ratings yet
OSY Winter 23
17 pages
CSC 210 Exam Guide
No ratings yet
CSC 210 Exam Guide
3 pages
Assignment
No ratings yet
Assignment
9 pages
DCPD - SoftSkills Reappear External VIVA Schedule - Nov'23
No ratings yet
DCPD - SoftSkills Reappear External VIVA Schedule - Nov'23
2 pages
Binary Search - Javatpoint
No ratings yet
Binary Search - Javatpoint
18 pages
Lecture 4 - Synthesis - Part 2 2025
No ratings yet
Lecture 4 - Synthesis - Part 2 2025
99 pages
Unit Iii
No ratings yet
Unit Iii
8 pages
Unit3 ppt4
No ratings yet
Unit3 ppt4
28 pages
Evehicle Registration SDK Korisnicko Uputstvo
No ratings yet
Evehicle Registration SDK Korisnicko Uputstvo
14 pages
Loop Practice Java
No ratings yet
Loop Practice Java
4 pages
Practical File For Student Final - 2024-25
No ratings yet
Practical File For Student Final - 2024-25
12 pages
Linked List
No ratings yet
Linked List
25 pages
Lecture Notes in Computer Science 6760: Editorial Board
100% (1)
Lecture Notes in Computer Science 6760: Editorial Board
19 pages
Latches and Flip Flops - 222
No ratings yet
Latches and Flip Flops - 222
15 pages
(MS-SHLLINK) - Shortcut To A File
No ratings yet
(MS-SHLLINK) - Shortcut To A File
4 pages
Guide To Graph Algorithms Sequential Parallel and Distributed Compress
No ratings yet
Guide To Graph Algorithms Sequential Parallel and Distributed Compress
475 pages
Homework Week 2 Big Oh
No ratings yet
Homework Week 2 Big Oh
3 pages
Python Basics and History Overview
No ratings yet
Python Basics and History Overview
39 pages
OOP Java Answers BCA 2024
No ratings yet
OOP Java Answers BCA 2024
2 pages
C++ Polymorphism and Encapsulation
No ratings yet
C++ Polymorphism and Encapsulation
14 pages
Computer Practical File C PDF Free
No ratings yet
Computer Practical File C PDF Free
46 pages
Multiple Object Recognition
No ratings yet
Multiple Object Recognition
39 pages

Pap 3 Shared Memory Algos

Uploaded by

Pap 3 Shared Memory Algos

Uploaded by

Parallel Algorithms and

Parallel algorithms in shared

Some shared-memory algorithms

Analysis of PRAM models

How to reason about parallel algorithms?

A model can be used to abstract away some of the complexity

About concurrent accesses to memory: 3 PRAM models

How powerful are the di erent models:

The sequential solution

Can we do better with a parallel algorithm?

A solution based on pointer jumping

# the list is stored in array *next*

This solution has an execution time in O(log n)

Note that the solution requires n PUs

Credit: Parallel algorithms, Casanova, Robert, Legrand.

In practice, if all processors do not execute synchronously,

Use of pointer jumping

We assume that we know sizeof(P)

It is usually important to parallelize the divide and the merge step:

With CRCW: O(1) steps

With CREW: O(1) steps

Unlimited amount of resources

Solution applying the pointer jumping technique

W (n) = O(n) × log(n)

If we do not have n processing units in practice, the large value of n

For instance, what would be a good algorithm on two processing

Performance of this algorithm

Execution time as a function of the number of PUs

You might also like

# the list is stored in array next