[go: up one dir, main page]

0% found this document useful (0 votes)
13 views51 pages

Lecture 5

Uploaded by

np03cs4s230155
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views51 pages

Lecture 5

Uploaded by

np03cs4s230155
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

6CS005 High Performance

Computing
Lecture 5 Parallel Programming
Matrix-Vector Multiplication
Pseudocode for serial matrix-vector multiplication

If A=(aij) is an m×n matrix and x is a vector with n


components, then y=A*x is a vector with m components, and
we can find the ith component of y by forming the dot product of
the ith row of A with x:
Parallelization of the matrix-vector
multiplication
• Need parallelize this by dividing the work among the threads.
• One possibility is to divide the iterations of the outer loop
among the threads.
• If we do this, each thread will compute some of the
components of y.
Parallelising Processing

• For example, suppose that m=6 and the number of threads,


thread_count or n, is three. Then the computation could be
divided among the threads as follows:
Contd…

• To compute y[0], thread 0 will need to execute the


code

• Therefore thread 0 will need to access every element


of row 0 of A and every element of x.
• More generally, the thread that has been assigned
y[i] will need to execute the code
Parallelising Processing

• Thus for n threads, and m rows:


– Thread 0
• start row = 0
• end row = (m/n) -1
– Thread 1
• start row = (m/n)
• end row = (m/n) + (m/n) -1
– Thread 2
• start row = (m/n) + (m/n)
• end row = (m/n) + (m/n) + (m/n) -1
– Thread 3
• start row = (m/n) + (m/n) + (m/n)
• end row = (m/n) + (m/n) + (m/n) + (m/n) -1


– Thread k
• start row = k * (m/n)
• end row = ((k + 1) * (m/n)) -1 /* the very last end row need to be adjusted if m is not a multiple
of n */
Pthread Function for matrix-vector multiplication
4x2 and 2x3 Matrix-Matrix
Multiplication
General Matrix-Matrix Multiplication
Pseudocode for General Matrix-Matrix Multiplication of A x
B=C
Race Condition

• A race condition arises in software when a


computer program, to operate properly,
depends on the sequence or timing of the
program's processes or threads.

• Critical race conditions often happen when the


processes or threads depend on some shared
state.
Race Condtion

Critical Sections
Busy Wait

Busy Wait
Mutex
Issues

• Since a thread that is busy-waiting may continually use the


CPU, busy-waiting is generally not an ideal solution to the
problem of limiting access to a critical section.
• Two better solutions are mutexes and semaphores.
• Mutex is an abbreviation of mutual exclusion, and a mutex is a
special type of variable that, together with a couple of special
functions, can be used to restrict access to a critical section to
a single thread at a time.
• Thus a mutex can be used to guarantee that one thread
“excludes” all other threads while it executes the critical
section.
• Hence, the mutex guarantees mutually exclusive access to
Fig: Run-times(in seconds) of Pi programs using n=108 term on system with two four-core processor
Barriers and Condition Variables
Barriers

• In shared-memory programming: synchronizing the


threads by making sure that they all are at the same
point in a program.
• Such a point of synchronization is called a barrier,
because no thread can proceed beyond the barrier
until all the threads have reached it.
Busy-waiting and a Mutex

• Implementing a barrier using busy-waiting and a


mutex is straightforward: we use a shared counter
protected by the mutex.
• When the counter indicates that every thread has
entered the critical section, threads can leave the
busy-wait loop.
Busy-waiting and a Mutex
Problems in Busy-Waiting and Mutex Implementati

• First barrier
– Threads use a busy-wait loop (while(counter <
thread_count)) to wait for all threads to reach the barrier,
threads consume CPU cycles while waiting
• Second barrier
– Reusing the same counter variable causes unsynchronized
behavior.
– At the start of the second barrier, counter =
thread_count from the first barrier, making the condition
counter < thread_count immediately false.
– As a result, threads bypass the second barrier, leading to
unsynchronized execution.
Contd…

• Resetting the counter after first barrier


– After all threads exit the first barrier but before they reach
the second barrier.
– Increments to counter may be lost, making the second
barrier condition unreachable (counter < thread_count
fails to meet).
– This causes deadlock as threads hang in the busy-wait
loop of the second barrier.
Solution???

• To resolve this issues:


– Use Separate Counters:
Allocate independent counters for each barrier instance to
avoid conflicts.
– Use Condition Variables(pthread_cond_wait):
Replace the busy-waiting loop with condition variables to
manage thread synchronization efficiently and avoid race
conditions (Threads block until signaled, avoiding CPU
waste.)
Implementing a barrier with semaphores
Contd…

• count_sem: Protects the counter and ensures mutual


exclusion during its modification.
• barrier_sem: Blocks threads at the barrier until all
threads have arrived.
• Threads increment the counter under protection by
count_sem.
• The last thread resets the counter and releases all
waiting threads by calling sem_post(&barrier_sem)
multiple times.
• Semaphores prevent busy-waiting, conserving CPU
resources.
Contd…

• counter and count_sem can be reused since they are


properly reset before threads exit the barrier.
• Reusing barrier_sem can lead to a race condition if
threads proceed to the next barrier before all
threads have finished decrementing barrier_sem in
the first barrier.
Solution???

• Separate barrier_sem for Each Barrier:


– Allocate a unique barrier_sem for each barrier to prevent
interference between barriers.
– Ensure threads only interact with the barrier_sem specific
to their current barrier.
• Synchronization Between Barriers:
– Ensure threads cannot proceed to the next barrier until all
threads have exited the first barrier.
– Use additional synchronization mechanisms (e.g., flags or
condition variables) to guarantee proper sequencing.
Condition Variables

• A somewhat better approach to creating a barrier in


Pthreads is provided by condition variables.
• A condition variable is a data object that allows a
thread to suspend execution until a certain event or
condition occurs.
• When the event or condition occurs another thread
can signal the thread to “wake up.”
• A condition variable is always associated with a
mutex.
Condition Variables

Typically, condition variables are used in constructs similar to


this pseudocode:
Contd…

• Lock mutex ensures exclusive access to shared


resources.
• If the condition has occurred, the thread notifies
other threads that are waiting on this condition
variable using pthread_cond_signal or
pthread_cond_broadcast
• The function
int pthread_cond_signal( pthread_cond_t∗ cond_var_p);
will unblock one of the blocked threads, and
int pthread_cond_broadcast( pthread_cond_t∗
cond_var_p); will unblock all of the blocked threads.
Contd…

• If not, the thread waits and releases the mutex,


allowing other threads to update the counter.
Implementing a barrier with condition variables
Contd…

• Int counter: Tracks the number of threads that have reached the
barrier.
• pthread_mutex_t mutex: Protects access to the shared counter to
prevent race conditions.
• pthread_cond_t cond_var: Used for condition-based synchronization
(to block and wake threads).
• pthread_mutex_lock(&mutex) locks the mutex to protect shared data.
• The counter is incremented by the thread to indicate it has reached the
barrier.
• If the counter equals thread_count, it means all threads have reached
the barrier. The thread resets the counter to 0.It then broadcasts a
signal using pthread_cond_broadcast(&cond_var) to wake all threads
waiting at the barrier.
• If the thread is not the last to reach the barrier, it waits on the condition
Pthreads Read-Write Locks

• Pthreads read-write locks, a synchronization


mechanism used to allow multiple threads to access a
shared resource with different levels of concurrency.
• Mutex vs. Read-Write Locks:
• A mutex allows only one thread to access a resource at any
given time.
• Read-write locks provide more flexibility by allowing
multiple threads to read the resource concurrently but
ensuring exclusive access for one thread when writing.
How Read-Write Locks Work?

• Read Lock (pthread_rwlock_rdlock):


– Allows multiple threads to simultaneously read the
resource.
– If any thread holds a read lock, other threads can also
acquire the read lock.
• Write Lock (pthread_rwlock_wrlock):
– Only one thread can hold a write lock at a time.
– If a thread holds the write lock, no other threads can
acquire either a read or a write lock.
• Unlock (pthread_rwlock_unlock):
– Used to release the read or write lock after the thread
finishes its operations on the shared resource.
Pthreads Read-Write Locks Functions

• int pthread_rwlock_rdlock(pthread_rwlock_t
*rwlock_p);
• int pthread_rwlock_tryrdlock(pthread_rwlock_t
*rwlock_p);
• int pthread_rwlock_wrlock(pthread_rwlock_t
*rwlock_p);
• int pthread_rwlock_trywrlock(pthread_rwlock_t
*rwlock_p);
• int pthread_rwlock_unlock(pthread_rwlock_t
*rwlock_p);
• int pthread_rwlock_init(pthread_rwlock_t
*rwlock_p,
Thread-Safety

• In shared-memory programming, thread safety


refers to the ability of a block of code to be executed
by multiple threads simultaneously without causing
issues, such as data corruption or incorrect results.
• Example
– Suppose we want to use multiple threads to “tokenize” a
file that consists of ordinary English text.
– The tokens are just contiguous sequences of characters
separated from the rest of the text by white-space — a
space, a tab, or a newline.
Simple approach

• Divide the input file into lines of text and assign the
lines to the threads in a round-robin fashion.
• The first line goes to thread 0, the second goes to
thread 1,…, the nth goes to thread n, the n +1st goes
to thread 0, etc.
• We can serialize access to the lines of input using
semaphores.
• After a thread has read a single line of input, it can
tokenize the line using the C standard library
"strtok()" function.
The strtok() function

• The first time it’s called the string argument should be


the text to be tokenized.
• For subsequent calls, the first argument should be
NULL.
• The idea is that in the first call, strtok caches a pointer
to string, and for subsequent calls it returns
successive tokens taken from the cached copy.
Running with one thread

• It correctly tokenizes the input stream.


– Pease porridge hot.
– Pease porridge cold.
– Pease porridge in the pot
– Nine days old.
Running with two threads
What happened?

• strtok() caches the input line by declaring a variable to


have static storage class.
• This causes the value stored in this variable to persist
from one call to the next.
• Unfortunately for us, this cached string is a global.
• Thus, thread 0’s call to strtok() with the third line of the
input has apparently overwritten the contents of thread
1’s call with the second line.
• So the strtok() function is not thread-safe.
• If multiple threads call it simultaneously, the output
may not be correct.
Other unsafe C library functions

• It’s not uncommon for C library functions to not be


thread-safe.
• The random number generator random in stdlib.h.
• The time conversion function localtime in time.h.
Solution??

• To resolve this issue, use a thread-safe version of


strtok, such as strtok_r (reentrant strtok), which
allows each thread to maintain its own internal state
during tokenization.

• Each thread would pass its own saveptr to strtok_r,


allowing them to tokenize independently without
affecting each other.
Summary

• A thread in shared-memory programming is analogous to a


process in distributed memory programming.
• However, a thread is often lighter-weight than a full-
fledged process.
• In Pthreads programs, all the threads have access to global
variables, while local variables usually are private to the
thread running the function.
• When indeterminate results arise from multiple threads
attempting to access a shared resource such as a shared
variable or a shared file, at least one of the accesses is an
update, and the accesses can result in an error, we have a
race condition.
Summary

• A critical section is a block of code that updates a


shared resource that can only be updated by one
thread at a time.
• So the execution of code in a critical section should,
effectively, be executed as serial code.
• Busy-waiting can be used to avoid conflicting access
to critical sections with a flag variable and a while-
loop with an empty body.
• It can be very wasteful of CPU cycles.
• It can also be unreliable if compiler optimization is
turned on.
Summary

• A mutex can be used to avoid conflicting access to


critical sections as well.
• Think of it as a lock on a critical section, since
mutexes arrange for mutually exclusive access to a
critical section.
• A semaphore is the third way to avoid conflicting
access to critical sections.
• It is an unsigned int together with two operations:
sem_wait and sem_post.
• Semaphores are more powerful than mutexes since
they can be initialized to any nonnegative value.
Summary

• A barrier is a point in a program at which the threads


block until all of the threads have reached it.
• A read-write lock is used when it’s safe for multiple
threads to simultaneously read a data structure, but if
a thread needs to modify or write to the data
structure, then only that thread can access the data
structure during the modification.
• Some C functions cache data between calls by
declaring variables to be static, causing errors when
multiple threads call the function.
• This type of function is not thread-safe.
End of Lecture 5

You might also like