0% found this document useful (0 votes)

3 views19 pages

Module 1

mod 1 pc

Uploaded by

heymangu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views19 pages

Module 1

mod 1 pc

Uploaded by

heymangu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Parallel Hardware

Some computer systems can do many things at the same time. This is called
parallel hardware.
 In some systems (like pipelining or multiple issue), the computer does
tasks in parallel inside the CPU, but the programmer doesn’t need to do
anything special.
 So, these are not counted as "true" parallel systems here.
 We only call it parallel hardware if the programmer can or must change
the code to make the system run things in parallel.

Types of Parallel Computers

There are two main ways to group parallel computers:
1. Flynn’s Taxonomy
This method groups computers by how many instructions and data streams
they can handle at once:
Type Name What it means
Normal computer – does one instruction on one
SISD Single Instruction, Single Data
data at a time.
Single Instruction, Multiple Same instruction runs on many pieces of data at
SIMD
Data once. (Used in GPUs)
Many processors run different instructions on
Multiple Instruction, Multiple
MIMD different data. (Used in multi-core CPUs and
Data
clusters)
2. Memory Access Model
This method is about how the processors access memory.
 Shared Memory:
All processors share the same memory. They can talk to each other
using that shared memory.
 Distributed Memory:
Each processor has its own memory. To share data, they must send
messages to each other.

SIMD Systems (Single Instruction, Multiple Data)

 SIMD is a type of parallel computing.
 It means: one instruction is applied to many pieces of data at the same
time.
 It is useful when you're doing the same thing repeatedly on different data
values.
WORKING OF SIMD
 A single control unit sends the same instruction to multiple datapaths.
 Each datapath applies the instruction to its own data.
 If the instruction can’t be used on some data, that datapath sits idle.
Example:
Suppose you want to add two arrays element-wise:
for (i = 0; i < n; i++)
x[i] += y[i];
In SIMD, if the system has n datapaths, all additions can be done at once.

Vector Processors
 A vector processor handles vectors (arrays of data) instead of single
numbers.
 It’s another way to do SIMD operations efficiently.
Features:
1. Vector Registers:
Can store a whole array (e.g. 4 to 256 numbers) in one register.
2. Vector Instructions:
Do operations like addition/multiplication on all elements in the vector
at once.
3. Pipelined Units:
Helps process multiple elements very quickly one after another.
Benefits:
 Faster for large loops and big arrays.
 Very efficient memory usage.
 Compilers can automatically convert code to vector form (vectorization).

GPUs (Graphics Processing Units)

 Originally built to render graphics.
 Now used for general computing (AI, simulations, etc.)
 Works like SIMD but more flexible.
Features:
1. Hardware Multithreading:
Can switch between many threads to keep execution fast.
2. SIMD + MIMD: (combine)
o Within one core → uses SIMD (same instruction on many data).
o Across multiple cores → uses MIMD (different instructions on
different cores).
3. Memory Use:
we Can use shared or distributed memory, or both.
Benefits:
 Very fast for image processing, AI, and scientific computing.
 Can handle hundreds or thousands of threads.
 Modern GPUs are powerful and widely used.
Distributed-Memory Interconnects
In distributed-memory systems, each processor has its own private memory,
and processors communicate through an interconnection network.

(a) Crossbar Layout

 We have 4 processors (P1–P4) and 4 memory modules (M1–M4).
 Each processor is connected to every memory module using a grid of
switches (the small circles).
 This means any processor can access any memory module directly.

(b) Internal Switches

 Two types of switch configurations:
1. Connected (i) → Allows data to pass (like a “green light”).
2. Not connected (ii) → Blocks the path (like a “red light”).
 These switches control traffic so processors don’t interfere with each
other.
Example

(a) Ring
 Processors (P1, P2, P3) are connected in a circle (ring).
 Each processor is connected to two neighbors.
 Data travels in one direction (or sometimes both directions).
 If one link fails, the whole ring may break.
 Example: Think of it like a circular train track, where each station (processor)
connects to the next one.
(b) Toroidal Mesh

Ring Bisection
 Here, the ring network is divided into two halves (A-side and B-side).
(a) Only 2 Communications
 Only two messages can cross between A and B at the same time.
 This creates a bottleneck.
(b) 4 Simultaneous Communications
 By rearranging, now four messages can be exchanged at once.
 This increases bandwidth and efficiency.
A bisection of a toroidal mesh

 The network is divided into two halves (left side vs. right side).
 Each square = a processor.
 Connections cross the middle line, linking left half to right half.

1. Fully Connected Network (Fig. 2.11)

 Each processor is directly linked to every other processor.

 Bisection Width: Very high (p²/4).

 Advantage: Fastest communication, best possible.

 Disadvantage: Too costly and impractical for many processors.

2. Hypercube (Fig. 2.12)

a) 1-D Hypercube

 Just 2 processors connected by a link.

b) 2-D Hypercube

 4 processors arranged like a square.

 Each connected to 2 others.

c) 3-D Hypercube

 8 processors arranged like a cube.

 Each processor connects to 3 others.

Indirect Network
 In direct networks (like mesh, ring, hypercube), processors are directly
connected to each other.
 In an indirect network, processors do not connect directly to every other
processor.
 Instead, they send data through a switching network (a kind of “middle
box” or router).

Omega Network
The Omega Network is a type of multistage interconnection network.
Key Features:
 Uses log₂(N) stages (if N = number of processors).
o Example: For 8 processors, it needs 3 stages (since log₂(8) = 3).
 Each stage has small 2×2 switches (simple, low cost).
 The path is decided by the binary address of the destination.

 Squares on left = Processors (Inputs).

 Circles = Switches (2×2).
 Right side = Outputs (Memory modules).
 Data flows step by step → Switch → Switch → Switch → Output.
Formula for calculate message passing :

 Latency = l seconds

 Bandwidth = b bytes/second

 Message size = n bytes

Then:

Cache coherence
Directory-Based Cache Coherence
Problem
 Many cores = many caches.
 If one core changes a variable, other cores might still use the old value in
their cache.
 We need a way to keep all caches in sync (cache coherence).

Snooping method (old way)

 Every time a variable changes, all cores are told (broadcast).
 Works for small systems but too slow when there are many cores.

Directory method (new way)

 A directory keeps a list of which cores have a copy of each variable
(memory block).
 When one core updates the variable:
1. It asks the directory.
2. The directory checks who has the copy.
3. Only those cores are told to update or delete their copy.
 No need to tell everyone — only the ones that need it.

False Sharing
 CPUs don’t store single variables in cache.
 They load cache lines (blocks of memory, e.g., 64 bytes).
 If one variable is inside a cache line, the whole line comes into same
cache line.
Example
 Array y[0..7], each element is a double (8 bytes).
 Cache line = 64 bytes → all 8 elements (y[0] to y[7]) fit in one cache line.
Now:
 Core 0 updates y[0], y[1], y[2], y[3].
 Core 1 updates y[4], y[5], y[6], y[7].

Shared-memory vs. Distributed-memory

1. Shared-memory systems
 All processors share the same memory.
 Processors communicate by reading/writing shared variables.
 Easy for programmers (just use common data).
 Problem:
o Hard to scale → if many processors try to use the same bus or
interconnect, they clash.
o Expensive hardware (e.g., large crossbars).
 Best for: small to medium systems (few processors).

2. Distributed-memory systems
 Each processor has its own private memory.
 Processors communicate by sending messages over a network (like
hypercube or toroidal mesh).
 Harder for programmers (must manage message passing).
 Advantage:
o Can scale to thousands of processors.
o Cheaper interconnects.
 Best for: very large problems with lots of data and computation.

Parallel Software
 Now a days we are using Parallel hardware everywhere like desktops,
servers, laptops, phones use multicore processors.
 Now We have to change that.
 Hardware is ready (many cores available).
 To get speedup, software must be written to use multiple
cores/processors.
 But Programmers need to learn .
o Shared-memory programming.
o Distributed-memory programming.
o MIMD (Multiple Instruction, Multiple Data).
o SIMD (Single Instruction, Multiple Data).
Caveats (pgm)
1. SPMD = Single Program, Multiple Data.
 Instead of running different programs on each core, we run one program
on all cores.
 The program behaves differently depending on the thread/process ID.
2. Data Parallelism with SPMD
 Easy to divide an array among processes/threads.
 Working with arrays.
 Calculate or run different arrays fun() same time with dividing.

Shared-memory
1. Shared vs Private Variables
 Shared variables → can be read/written by any thread. (Used for
communication).
 Private variables → belong to one thread only.
2. Two Thread Paradigms
a. Dynamic Threads
 There is one master thread and a pool of worker threads.
 Workflow:
1. Master waits for a request (e.g., from user/network).
2. When work arrives → it creates (forks) a worker thread.
3. Worker thread does the task.
4. After finishing, worker joins back to master and terminates.
 Advantage: Saves resources (threads exist only while working).
 Disadvantage: Forking and joining threads repeatedly takes extra time.
b. Static Threads
 Master thread starts all threads at the beginning.
 Threads run until all work is finished.
 After work → they join back, cleanup happens, and then program ends.
 Advantage: Faster.
 Disadvantage: Wastes resources if some threads are idle.

Nondeterminism
 Deterministic program → same input always gives the same output.
 Nondeterministic program → same input may give different outputs in
different runs.
This usually happens in parallel programs, especially with MIMD systems.

Updating a shared variable (x)

Two threads want to update a shared variable x.
Example :
my_val = Compute_val(my_rank);
x += my_val; // both threads try to update x

 Thread 0 loads x (0).

 Thread 1 loads x (0).
 Thread 0 adds its my_val (say 7).
 Thread 0 stores result (7).
 Thread 1 adds its my_val (say 19).
 Thread 1 stores result (19).
Final x = 19 (Thread 0’s update value lost)
Thread Safety
 Thread safe = Safe for many threads to use at the same time.
 Not thread safe = If many threads use it together, mistakes will happen.
 Thread Safe → works correctly with many threads.
 Not Thread Safe → causes errors if many threads use it together.

Distributed-Memory
 Each processor (core) has its own private memory.
 One processor cannot directly access another processor’s memory.
 To share information, processors must communicate with each other.

They use different models to communicate:

(a) Message Passing (most common → MPI)
 Process A sends a message.
 Process B receives that message.
 Example:
o Process 1 → "Hello" → Process 0.
o Process 0 → prints: Hello.
 Very powerful, but programmer must manage all details.
(b) One-Sided Communication (Remote Memory Access)
 Only one process does the action (no matching send/receive needed).
 Example: Process 0 directly copies data into Process 1’s memory.
 Advantage: simpler, less waiting.
(c) PGAS (Partitioned Global Address Space) Languages
 Looks like shared memory programming, but actually runs on distributed
systems.
 Memory is split into local + shared parts.
 Programmer controls which data lives where, to avoid slow “remote”
memory access.

PP 1st Internals
No ratings yet
PP 1st Internals
40 pages
PARALLEL PROGRAMMING Module 1
No ratings yet
PARALLEL PROGRAMMING Module 1
20 pages
BCS702 Module1
No ratings yet
BCS702 Module1
25 pages
PC - Ia01
No ratings yet
PC - Ia01
10 pages
BCS702 Module1 Detailed Notes
No ratings yet
BCS702 Module1 Detailed Notes
14 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Module 1 PC
No ratings yet
Module 1 PC
29 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Unit 4
No ratings yet
Unit 4
9 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
L32 SMP
No ratings yet
L32 SMP
47 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
Multi Core
No ratings yet
Multi Core
7 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
MULTIPROCTLPA
No ratings yet
MULTIPROCTLPA
99 pages
Parallel Computing and BigData
No ratings yet
Parallel Computing and BigData
4 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
Part B Ma
No ratings yet
Part B Ma
16 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
Seminar
No ratings yet
Seminar
85 pages
Parallel Processing Explained
No ratings yet
Parallel Processing Explained
22 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
atII Bks Lec 2021 31 32
No ratings yet
atII Bks Lec 2021 31 32
16 pages
PP16 Lec4 Arch3
No ratings yet
PP16 Lec4 Arch3
23 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
No ratings yet
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
22 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
18 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
William Stallings Computer Organization and Architecture: Parallel Processing
No ratings yet
William Stallings Computer Organization and Architecture: Parallel Processing
40 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
33 pages
Unit 1
No ratings yet
Unit 1
21 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
CH17-COA10e - Parallel Processing
No ratings yet
CH17-COA10e - Parallel Processing
45 pages
Pipelining: Load Store
No ratings yet
Pipelining: Load Store
16 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
CS213 Parallel Processing Syllabus
No ratings yet
CS213 Parallel Processing Syllabus
26 pages
Advanced Computer Architecture Unit 1
No ratings yet
Advanced Computer Architecture Unit 1
23 pages
Distributed OS: Memory & Multiprocessors
No ratings yet
Distributed OS: Memory & Multiprocessors
89 pages
Quiz Prep
No ratings yet
Quiz Prep
21 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
Perfect ? I
No ratings yet
Perfect ? I
7 pages
Lecture 5 Network Topologies For Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies For Parallel Architectures - Updated
46 pages
Pda 2
No ratings yet
Pda 2
105 pages
CA Lecture 13
No ratings yet
CA Lecture 13
27 pages
Module1 1
No ratings yet
Module1 1
32 pages
20 Question of CA
No ratings yet
20 Question of CA
26 pages
Module 1-3
No ratings yet
Module 1-3
87 pages
Computer Architecture Basics
No ratings yet
Computer Architecture Basics
16 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Unit 4
No ratings yet
Unit 4
16 pages
NOTES
No ratings yet
NOTES
19 pages
Data Structures C Lab Manual
No ratings yet
Data Structures C Lab Manual
34 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
52 pages
Python Program File
No ratings yet
Python Program File
13 pages
Multicores, Multiprocessors, and P, Clusters
No ratings yet
Multicores, Multiprocessors, and P, Clusters
51 pages
PM Debug Info
No ratings yet
PM Debug Info
14 pages
Digital Signature Act 1997 Malaysia
No ratings yet
Digital Signature Act 1997 Malaysia
62 pages
Locksmith Training for Aspiring Pros
No ratings yet
Locksmith Training for Aspiring Pros
4 pages
Jual Beli Cartridge Bekas
No ratings yet
Jual Beli Cartridge Bekas
3 pages
F16v3 Manual v1.02
100% (1)
F16v3 Manual v1.02
114 pages
Industrial and Organizational Psychology Research and Practice 7th Edition Spector Full Download
100% (1)
Industrial and Organizational Psychology Research and Practice 7th Edition Spector Full Download
408 pages
CHOICE HDTV COMPRESSION EBU - r124
No ratings yet
CHOICE HDTV COMPRESSION EBU - r124
4 pages
HA032852 2 EPack-2PH Data Sheet
No ratings yet
HA032852 2 EPack-2PH Data Sheet
6 pages
Suksespreneur Agency Rate Card
No ratings yet
Suksespreneur Agency Rate Card
3 pages
CCNA 1 v7 Modules 1 - 3 - Basic Network Connectivity and Communications Exam Answers
No ratings yet
CCNA 1 v7 Modules 1 - 3 - Basic Network Connectivity and Communications Exam Answers
36 pages
Question Paper Unit Test - 1
No ratings yet
Question Paper Unit Test - 1
1 page
Soalan Mate Tahun 3 Setengan Tahun
50% (2)
Soalan Mate Tahun 3 Setengan Tahun
11 pages
5G Deployment Insights & Challenges
100% (1)
5G Deployment Insights & Challenges
40 pages
Saurabh Mishra Mini Project 2 Report MBA 2nd Semester 2 1
No ratings yet
Saurabh Mishra Mini Project 2 Report MBA 2nd Semester 2 1
50 pages
75 如何遍历js的table对像
No ratings yet
75 如何遍历js的table对像
9 pages
Grocery Management System Project
No ratings yet
Grocery Management System Project
36 pages
Word Processing Letterhead Format Instructions
No ratings yet
Word Processing Letterhead Format Instructions
4 pages
M365-WW-no-Teams-FAQ-Partner-Only (1) 1
No ratings yet
M365-WW-no-Teams-FAQ-Partner-Only (1) 1
16 pages
Tanu
No ratings yet
Tanu
2 pages
CC Brand Av-Specifications v3-0
No ratings yet
CC Brand Av-Specifications v3-0
18 pages
Blank Utility Conflict Matrix
No ratings yet
Blank Utility Conflict Matrix
1 page
File Exchange Advanced Instructions
No ratings yet
File Exchange Advanced Instructions
107 pages
Nas Simple Trading
No ratings yet
Nas Simple Trading
4 pages
Digital Electronics 8
No ratings yet
Digital Electronics 8
12 pages
Multiple Choice Question On Autocad
No ratings yet
Multiple Choice Question On Autocad
20 pages

Module 1

Uploaded by

Module 1

Uploaded by

Parallel Hardware

Types of Parallel Computers

SIMD Systems (Single Instruction, Multiple Data)

GPUs (Graphics Processing Units)

(a) Crossbar Layout

(b) Internal Switches

1. Fully Connected Network (Fig. 2.11)

 Bisection Width: Very high (p²/4).

 Disadvantage: Too costly and impractical for many processors.

2. Hypercube (Fig. 2.12)

 Just 2 processors connected by a link.

 4 processors arranged like a square.

 Each connected to 2 others.

 8 processors arranged like a cube.

 Each processor connects to 3 others.

 Squares on left = Processors (Inputs).

 Message size = n bytes

Snooping method (old way)

Directory method (new way)

Shared-memory vs. Distributed-memory

Updating a shared variable (x)

 Thread 0 loads x (0).

They use different models to communicate:

You might also like