1. Adv CA - slide deck 1

Advanced Computer
Architecture CMP 511

Introduction to Parallel
Processing
Department of Computer Engineering

Pokhara University
What is Computer Architecture?
Application
Application
Physics
Application
Gap too large to

bridge in one step
Physics
Application
In its broadest definition,

computer architecture is the
Gap too large to design of the
bridge in one step abstraction/implementation
layers that allow us to
execute information
processing applications
efficiently using
manufacturing technologies
Physics
Application
In its broadest definition,

computer architecture is the
Gap too large to design of the
bridge in one step abstraction/implementation
layers that allow us to
execute information
processing applications
efficiently using
manufacturing technologies
Physics
Abstractions in Modern Computing
Systems
Application
Algorithm
Programming Language
Operating System/Virtual Machines
Instruction Set Architecture
Microarchitecture
Register-Transfer Level
Gates
Circuits
Devices
Physics
Abstractions in Modern Computing
Systems
Application
Algorithm
Programming Language
Computer Architecture
Microarchitecture
Gates
Circuits
Devices
Physics
Computer Architecture is
Constantly Changing
Application
Application Requirements:
Algorithm • Suggest how to improve architecture
Programming Language • Provide revenue to fund development
Microarchitecture
Gates
Circuits Technology Constraints:
Devices • Restrict what can be done efficiently
• New technologies make new arch
Physics possible
Computer Architecture is
Constantly Changing
Application
Application Requirements:
Algorithm • Suggest how to improve architecture
Programming Language • Provide revenue to fund development
Instruction Set Architecture Architecture provides feedback to guide
Microarchitecture application and technology research
Register-Transfer Level directions
Gates
Circuits Technology Constraints:
Devices • Restrict what can be done efficiently
• New technologies make new arch
Physics possible
Computers Then…
IAS Machine. Design directed by John von Neumann.

First booted in Princeton NJ in 1952
Smithsonian Institution Archives (Smithsonian Image 95-06151)
Computers Now
Sensor Nets
Cameras
Games
Set-top
boxes
Media
Players Laptops Servers
Routers Robots
Smart
phones
Automobiles
Supercomputers 12
Major
Technology
Generations Bipolar
CMOS
nMOS
Vacuum
pMOS
Tubes
Relays
[from Kurzweil]
Electromechanical
Sequential Processor Performance
From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
RISC
Move to multi-processor
RISC
Course Structure
• Recommended Readings
• In-Lecture Questions
• Problem Sets
– Very useful for exam preparation
– Peer Evaluation
• Midterm
• Final Exam
Course Content Computer Organization
Computer Organization
• Basic Pipelined
Processor
~50,000 Transistors
Photo of Berkeley RISC I, © University of California (Berkeley)

Architecture vs. Microarchitecture
“Architecture”/Instruction Set Architecture:
• Programmer visible state (Memory & Register)
• Operations (Instructions and how they work)
• Execution Semantics (interrupts)
• Input/Output
•Data Types/Sizes
Microarchitecture/Organization:
• Tradeoffs on how to implement ISA for some metric
(Speed, Energy, Cost)
• Examples: Pipeline depth, number of pipelines, cache
size, silicon area, peak power, execution ordering, bus
widths, ALU widths
Software Developments
up to 1955 Libraries of numerical routines
- Floating point operations
- Transcendental functions
- Matrix manipulation, equation solvers, . . .
1955-60 High level Languages - Fortran 1956

Operating Systems -
- Assemblers, Loaders, Linkers, Compilers
- Accounting programs to keep track of
usage and charges
Software Developments
up to 1955 Libraries of numerical routines
- Floating point operations
- Transcendental functions
- Matrix manipulation, equation solvers, . . .
1955-60 High level Languages - Fortran 1956

Operating Systems -
- Assemblers, Loaders, Linkers, Compilers
- Accounting programs to keep track of
usage and charges
Machines required experienced operators
• Most users could not be expected to understand

these programs
Compatibility Problem at IBM
By early 1960’s, IBM had 4 incompatible lines of
computers!
701  7094
650  7074
702  7080
1401  7010
computers!
701  7094
650  7074
702  7080
1401  7010
Each system had its own
• Instruction set
• I/O system and Secondary Storage:
magnetic tapes, drums and disks
• assemblers, compilers, libraries,...
• market niche business, scientific, real time, ...
computers!
701  7094
650  7074
702  7080
1401  7010
Each system had its own
• Instruction set
• I/O system and Secondary Storage:
magnetic tapes, drums and disks
• assemblers, compilers, libraries,...
• market niche business, scientific, real time, ...
 IBM 360
Same Architecture Different
Microarchitecture
AMD Phenom X4 Intel Atom
• X86 Instruction Set • X86 Instruction Set
• Quad Core • Single Core
• 125W • 2W
• Decode 3 Instructions/Cycle/Core • Decode 2 Instructions/Cycle/Core
• 64KB L1 I Cache, 64KB L1 D Cache • 32KB L1 I Cache, 24KB L1 D Cache
• 512KB L2 Cache • 512KB L2 Cache
• Out-of-order • In-order
• 2.6GHz • 1.6GHz
Different Architecture Different
Microarchitecture
AMD Phenom X4 IBM POWER7
• X86 Instruction Set • Power Instruction Set
• Quad Core • Eight Core
• 125W • 200W
• Decode 3 Instructions/Cycle/Core • Decode 6 Instructions/Cycle/Core
• 64KB L1 I Cache, 64KB L1 D Cache • 32KB L1 I Cache, 32KB L1 D Cache
• 512KB L2 Cache • 256KB L2 Cache
• Out-of-order • Out-of-order
• 2.6GHz • 4.25GHz
Where Do Operands Come from And Where
Do Results Go?
Do Results Go?
ALU
Do Results Go?
ALU
Memory
Do Results Go?
…
Processor
ALU
Memory
Do Results Go?
Where Do Operands Come from And Where Do
Results Go?
Stack
…
TOS
Processor
ALU
…
Memory
Do ResultsStack
Go? Accumulator
…
TOS
Processor
Processor
ALU ALU
… …
Memory
Memory
Do ResultsStack
Go? Accumulator Register-
Memory
… …
TOS
Processor
Processor
Processor
ALU ALU ALU
… … …
Memory
Memory
Memory
Do ResultsStack
Go? Accumulator Register- Register-
Register Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
… … … …
Memory
Memory
Memory
Memory
Do ResultsStack
Go? Accumulator Register- Register-
Register Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
… … … …
Memory
Memory
Memory
Memory
Machine Model Summary
Register- Register-
Stack Accumulator Register
Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
… … … …
Memory
Memory
Memory
Memory
Register- Register-
Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
C=A+B
… … … …
Memory
Memory
Memory
Memory
Register- Register-
Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
C=A+B
… … … …
Memory
Memory
Memory
Memory
Push A Load A Load R1, A Load R1, A
Push B Add B Add R3, R1, B Load R2, B
Add Store C Store R3, C Add R3, R1, R2
Pop C Store R3, C
Real World Instruction Sets
Arch Type # Oper # Mem Data Size # Regs Addr Size Use
Alpha Reg-Reg 3 0 64-bit 32 64-bit Workstation
ARM Reg-Reg 3 0 32/64-bit 16 32/64-bit Cell Phones,

Embedded
MIPS Reg-Reg 3 0 32/64-bit 32 32/64-bit Workstation,
Embedded
SPARC Reg-Reg 3 0 32/64-bit 24-32 32/64-bit Workstation
TI C6000 Reg-Reg 3 0 32-bit 32 32-bit DSP
IBM 360 Reg-Mem 2 1 32-bit 16 24/31/64 Mainframe
x86 Reg-Mem 2 1 8/16/32/ 4/8/24 16/32/64 Personal

64-bit Computers
VAX Mem-Mem 3 3 32-bit 16 32-bit Minicomputer
Mot. 6800 Accum. 1 1/2 8-bit 0 16-bit Microcontroler

80
Why we are building Parallel Systems?
• Here we will see how a basic computer system looks like and how do we
parallelize it to make a big parallel system
• Single Processor Performance
• Transistor density
• Speed
• ILP
• Speed increases, power increases, heat increases
• How do we exploit the increasing number of transistors without increasing
the clock frequency and still get the better performance?
• Add parallelism
• Multicore
Components of Computer System
• North Bridge Chip
• The Northbridge part of the chipset controls the high-speed channels
• The northbridge handles the high-speed communication between the CPU, memory,
and graphics card
• PCI chip
• Peripheral Component Interconnect (PCI) is a standard interface that connects
hardware devices to a computer's processor bus.
• PCI is a high-speed bus that's found in almost all desktop computers
• South Bridge Chip
• Southbridge controls the lower speed devices.
• The southbridge manages the slower input/output (I/O) operations and connects
devices like hard drives (HD), universal serial bus (USB) devices, and audio interfaces
Components of Computer System
Generic multiprocessor system with
distributed memory
Generic Parallel Architecture
• Processor node
• Several processor nodes are connected through global interconnect
• network helps to transmit data between the nodes
• Either bus based or point to point
• Connected using network interface
• Input Output (I/O) devices like hard disk are connected to the global
interconnect
Parallelism in Architectures
• Scalar Processor
• ADD R1, R2, R3
• Fetch, Decode, Execute
• ILP – Instruction level Parallelism
• TLP – Thread Level Parallelism
• Vector and Array Processor
Instruction Level Parallelism
• Instruction-level parallelism (ILP) is a technique that allows a
computer processor to execute multiple instructions at the same
time.
• If the instructions are independent then we can exploit this
parallelism
• If the instruction are dependent then we cannot exploit this
parallelism because the result of one will be required by the result of
another
ILP – vector addition example
• For(j=0; j<1000; j++)
• A[j] = B[j] + C[j]
• Here each loop iteration is concurrent and will run parallelly, implies
we have 1000 possible instructions which can be done in parallel
• Processor will take help from the compiler to do this because as a
programmer we have written that for loop and we don’t try to
understand where is the ILP or help the processor exploit ILP.
• WHO HELPS to exploit ILP?
• It’s the job of compiler and the micro architecture
Thread Level Parallelism
• Splits code in to parallel threads which can execute independently on
multiple core
• In fact threads may have to synchronize with each other on some
occasions to exchange data values
• TLP is the task of the programmer
• Need to design new algorithms
• Re-structure code to expose and manage threads
• coordinating and integrating multiple channels and technologies to improve
communication and customer experience
• Synchronization among threads – it is a process that ensures that multiple threads work
together to reach a known point of operation before continuing.
Parallel Architecture Classification Scheme
• There are three computer architectural classification scheme based
on the multiplicity of instructions and data streams. They are as given
below:
1) Flynn’s Classification’(1966) – SISD, SIMD, MISD, MIMD
2) Feng’s Classification (1972)
3) Handler’s Classification (1977)
Flynn’s Classification – SISD computer
• Single instruction stream single data stream
• One functional unit
• IBM 701, IBM 1620
• Multiple functional unit
• IBM 360-91, CDC 660
Flynn’s Classification – SIMD computer
• Single instruction stream multiple data stream
• Example
• Illiac IV
• STARAN
Flynn’s Classification – MISD computer
• Multiple instruction stream single data stream
• Here different processing elements run different programs on the
same data path
Flynn’s Classification – MIMD computer
• Multiple instruction stream Multiple data stream
• Example
• UNIVAC 1100
• Borroughs D-825
• Tightly coupled MIMD
• Loosely coupled MIMD
• Multiple SISD
Feng’s Classification
• Feng’s Classification was given by Feng in the year 1972.
• This is based on degree of Parallelism (DOP).
• It is defined as number of bits that can be processed in unit time is
called Degree of Parallelism.
• Maximum degree of parallelism P of a Computer C, denoted as P(C) is
given by P(C) =m.n
where, m = Bit slice length and n = Word length
• Bit slice is string of bits taken from all words of same vertical
positions.
• If m is on Y-axis, and n is on X-axis, then
• Then, mn shows area or maximum degree of parallelism.
• Accordingly Computers can be classified as
• WSBS (Word Serial Bit Serial)
• WSBP (Word Serial Bit Parallel)
• WPBS (Word Parallel Bit Serial)
• WPBP (Word Parallel Bit Parallel)
• WSBS is a slow process. WSBS was done in 1st generation Computer.
• In WSBS 1 bit is processed at a time.
• WPBP provides highest degree of parallelism that is fastest Computer.
Example:
• WSBS – MINIMA (1, 1)
• WSBP - Burrough 7700 (48, 1)
• WPBS – DAP (1, 4096)
• WPBP - TI-ASC (64,32)
“TI-ASC has a word length of 64 bits and there are 4 arithmetic pipelines in it, each
pipeline consisting of 8 stages. Thus, 8 x 4 =32 bits are processed together at one time.
Therefore, TI-ASC will be represented on graph as (64,32)”
Handler’s Classification
• In 1977, Wolfgang Handler presented a computer architectural
classification scheme for determining the degree of parallelism and
pipelining built into the computer system hardware.
• In Handler’s classification pipeline processing systems are divided into
three subsystems:
• Processor Control Unit (PCU): Each PCU corresponds to one processor or one
CPU.
• Arithmetic Logic Unit (ALU): ALU is equivalent to the processing element (PE).
• Bit Level Circuit (BLC): BLC corresponds to the combinational logic circuit
required for 1-bit operations in ALU.
Handler’s Classification
• Handler’s classification uses three pairs of integers containing 6
independent entities that describe the computer system:
Computer = < K*K’, D*D’, W*W’>
where K = number of processors (PCUs) within the computer
K’ = number of PCUs that can be pipelined
D = number of ALUs (PEs) under the control of PCU
D‘ = number of PEs that can be pipelined
W = word length of a PE
W’ = number of pipeline stages in all PEs
Handler’s Classification: Example
• Example 1: Let us consider an example of Texas Instruments’
Advanced Scientific Computer (TI ASC), which has one controller that
controls 4 arithmetic pipelines, each having a 64-bit word length and
8 pipeline stages.
• From this data, we get K = 1, K’ = 1, D = 4, D’ = 1, W = 64, W’ = 8 . So,

we can represent TI ASC according to Handler’s classification as
follows :
TI ASC = <1*1, 4*1, 64*8> or <1, 4, 64*8>
Computer Performance: CPU execution time
• Computer performance is the total amount of work accomplished by a
computer system.
• “How well the computer is doing the work it is supposed to do?”
• Computer performance is determined by the execution time
• Performance is inversely proportional to the execution time
1
Performance =
Execution Time
Case: Processor A is faster than processor B

• It implies execution time of A is less than that of B
• Performance of A is better than that of B
Example: Machine A runs a program in 100 sec
Machine B runs a program in 125 sec
PerformanceA Execution TimeB
Speedup = n = =
PerformanceB Execution TimeA
Therefore, Processor A is 1.25 times faster than processor B
The total time a CPU spends computing on a given task is called CPU
time or CPU execution time
• A program is comprised of a number of instructions executed , measured
in: instructions/program
• The average instruction takes a number of cycles per instruction (CPI) to be
completed, measured in: cycles/instruction, CPI
• CPU has a fixed clock cycle time C = 1/clock rate , measured in:
seconds/cycle
• CPU execution time is the product of the above three parameters as
follows:
CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle
T = I x CPI x C
Case Study:
“How can we increase the performance on hardware and software while

designing a processor?”
To improve the performance we can either

a) Decrease the clock cycles per instruction (CPI) by using new
hardware
b) Decrease the clock time or increase the clock rate by reducing
propagation delays or by use of pipelining
c) Decrease the no of required cycles or improve ISA or compiler
Parallelism
• Parallelism in computing is the process of breaking down a large task
into smaller parts that can be processed simultaneously on multiple
processors / CPU/ Cores
• In parallelism, we have different cores, different CPU, different
processor and our application use that network to get the result
• The goal of parallelism is to reduce the overall time it takes to complete a
computation
Parallelism (Pipelining)
• Pipelining is a technique of decomposing a sequential process into
sub-operation segments, where each segment perform its related
task concurrently and transfers its result into next segment and at the
end of the last segment, final result is obtained.
• Pipelining is the process of arrangement of hardware elements of the
CPU such that its overall performance is increased.
• Simultaneous execution of one or more instruction takes place in
pipelined processors.
• In pipelining multiple overall instruction are overlapped in execution
Performance Metrics for Parallel System
• Execution time
𝑇𝑡𝑜𝑡𝑎𝑙 = p x 𝑇𝑝
• Total Overhead
𝑇𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = (p x 𝑇𝑝 - 𝑇𝑠 )
where, 𝑇𝑠 → serial run time
𝑇𝑝 → parallel run time with ‘p’
processing elements
𝑇𝑠
• Speedup: S =
𝑇𝑝
Adding n numbers using n processing elements
• It is defined as the ratio of the time taken to solve a problem on a single
processing element to the time required to solve the same problem on a
parallel computer with p identical processing elements.
• Consider the problem of adding n numbers by using n processing elements.
Initially, each processing element is assigned one of the numbers to be
added and, at the end of the computation, one of the processing elements
stores the sum of all the numbers. Assuming that n is a power of two, we
can perform this operation in logn steps by propagating partial sums up a
logical binary tree of processing elements.
• The processing elements are labeled from 0 to 15. Similarly, the 16 numbers
to be added are labeled from 0 to 15.
Adding n numbers using n processing elements
• Speedup
𝑇𝑠 Ɵ(𝑛) 𝑛
• We denote speedup for the addition of n numbers is given as, S = = =Ɵ( )
𝑇𝑝 Ɵ(𝐿𝑜𝑔𝑛) 𝐿𝑜𝑔𝑛
• Efficiency
• Efficiency is a measure of the fraction of time for which a processing element is usefully employed
• It is defined as the ratio of speedup to the number of processing elements. In an ideal parallel
system, speedup is equal to ‘p’ and efficiency is equal to 1.
• In practice, speedup is less than ‘p’ and efficiency is between 0 and 1, depending on the
effectiveness with which the processing elements are utilized. We denote efficiency by the symbol
E. Mathematically, it is given by
𝑆
E=𝑝
• Therefore the efficiency

𝑛
of the algorithm for adding n numbers on n processing elements is
𝑆 𝐿𝑜𝑔𝑛 1
E = = Ɵ( ) = Ɵ( )
𝑝 𝑛 𝐿𝑜𝑔𝑛
• The overall parallel execution time of this parallel system is
Ɵ ( (n/p) log p)
• The cost is Ɵ (n log p), which is asymptotically higher than the Ɵ (n)
cost of adding n numbers sequentially. Therefore, the parallel system
is not cost-optimal
Can we build granularity in the above
example in a cost-optimal fashion?
• Often, using fewer processors improves performance of parallel
systems.
• Using fewer than the maximum allowable number of processing
elements to execute a parallel algorithm is called scaling down a
parallel system.
• A naive way of scaling down is to think of each processor in the
original case as a virtual processor and to assign virtual processors
equally to scaled down processors.
• Since the number of processing elements (or communication)
decreases by a factor of n / p, the computation at each processing
element increases by a factor of n / p.
Can we build granularity in the above
example in a cost-optimal fashion?
Cost Optimal way
• The parallel runtime of this algorithm now is
Ɵ ( (n/p) + log p)
• The cost is Ɵ (n + p log p)
• As long as n = Ω(p log p), the cost is Ɵ(n), which is the same as the
serial runtime. Hence, this parallel system is cost-optimal.
Solve: A 4 stage pipeline has stage delay as 150, 120, 160 and 140 ns.
Registers are used in between stages and have a delay of 5 ns. Assuming
constant clock rate;
Find the total time to process 1000 data items on this pipeline
𝑛𝑡𝑛
Hint: Speed up ratio (S) =
[𝐾+ 𝑛−1 ∗𝑡𝑝]
𝑡𝑜𝑡𝑎𝑙 𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑓𝑜𝑟 𝑛−𝑡𝑎𝑠𝑘 𝑖𝑛 𝑛𝑜𝑛 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒

=
𝑡𝑜𝑡𝑎𝑙 𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑓𝑜𝑟 𝑛−𝑡𝑎𝑠𝑘 𝑖𝑛 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒
As the result of task increases, n becomes much larger resulting to n-1 to be n.
So, K+(n-1) => n
𝑛𝑡𝑛 𝑡
Hereby, maximum speed up ratio (S) = = 𝑛
𝑛𝑡𝑝 𝑡𝑝
where,
k = no of segments
𝑡𝑝 = time required for n task in pipeline
n = no of task/instructions
𝑡𝑛 = time required for n task in non
pipeline
Case Study:
“Can we reduce the size of the instructions?”
Opcode Address of Address of Destination Address of next

Operand1 Operand1 Address instruction
Fig. Instruction set format

Parallel Programming Techniques
• There are two basic approaches in parallelizing the program as under
• Explicit Parallel Programming
• Implicit Parallel Programming
Explicit Parallel Programming
• Here the programmer develops a parallel program by explicitly
specifying the parallelism.
• The programmer has to write down the parallel algorithm to solve the
give program
• The techniques are:
• Use of specific programming languages which supports parallel programming
• Use of compiler directives for library function calls
• Use of automatic parallelization tools that analyze the program and find
parallelism
Implicit Parallel Programming
• In this technique, the programmer is relieved from parallelizing the
program but it is the compiler which plays an important role.
• Here the compiler is responsible for extracting the parallelism in the
program. There is no burden on the programmer to know about the
parallel programming languages and also the underlying parallel
architecture.
• We shall be considering only the explicit programming approach here.
Parallel Programming Models
Computer
• Sequential computer
• uniprocessor
• Parallel Computer
• Shared Memory (common memory, multiprocessor)
• UMA (uniform memory access)
• NUMA (non uniform memory access)
• Shared local memory
• Hierarchical cluster
• COMA (cache only memory access)
• Unshared Memory (distributed memory, multicomputer, message passing
multicomputer)
• NORMA (no remote memory access)
UMA model for multiprocessor system
From kai Hwuang, 3e, p19

NUMA model for multiprocessor system

NUMA model for multiprocessor system

COMA model for multiprocessor system

Distributed Memory Multicomputer

32
Classification based on memory arrangement

• Parallel architectures can be classified into two major categories
in terms of memory arrangement:
• Shared memory
• Distributed memory or message passing
• This classification constitutes a subdivision of MIMD parallel
architecture and are also known as:
• Shared memory architecture  tightly coupled architecture
• Distributed memory architecture  loosely coupled architecture
• A third choice is somewhere in between: groups of processors

share a memory block while the different groups, also referred
to as nodes, have distinct memory blocks
Intensive Computation - 2022/2023 38 of 46
Shared Memory Multiprocessor

• An UMA architecture comprises two or more processors with
identical characteristics
• The processors:
• share the same main memory and I/O facilities
• are interconnected by networks as some form of bus-based,
crossbars, or multiport memories
• Processors perform the same functions under control of an

operating system, which provides interaction between
processors and their programs at the job, task, file and data
element levels
Shared Memory Multiprocessor

• In the case of NUMA architectures, memory is physically
distributed but logically shared
• Each processor has part of the shared memory attached, but
the memory has a single address space, therefore, any
processor could access any memory location directly
• The memory access time of processors differs depending on
which region of the main memory is accessed
• A subclass of NUMA system is cache coherent NUMA (cc-

NUMA) where cache coherence is maintained among the
caches of various processors
• The main advantage of a cc-NUMA system is that it can deliver
effective performance at higher levels of parallelism
Message Passing Multicomputer

• In a distributed memory architecture each unit is a complete
computer building block including the processor, memory
and I/O system
• Units are referred to as nodes, and there is no sharing of
memory between them
• Nodes are typically able to store messages in buffers
(temporary memory locations where messages wait until
they can be sent or received), and perform send/receive
operations at the same time as processing
• Hence, communication among the processors is established
in the form of I/O operations through message signals and
interconnection networks
Message Passing Multicomputer

• Example
• If a processor needs data from another processor
• It sends a signal to that processor through an interconnection network
demanding the required data
• The remote processor then responds accordingly
• Notice that, the further the physical distance to the remote

processor, the longer it will take to access the remote data
• The processing units of a message passing system may be
connected in a variety of ways ranging from architecture-
specific interconnection structures to geographically dispersed
networks
• The message passing approach is, in principle, scalable to large
proportions
Example: Sum of array elements
Pseudo code for sequential program
sum = 0;
int A[maxsize];
int main()
for(i=0; i<maxsize; i++)
{
sum +=A[i];
}
print(sum);
Pseudo code for parallel algorithm using Message Passing
//shared varibales: // pid = process ID, o to nproc-1

sum = 0; hi = lo + maxsize/nproc;
int A[maxsize]; for(i=lo; i<hi; i++)
int lockvar; {
int joinvar; mysum += A[i];
}
//private varibale BARRIER (joinvar);
int lo, hi, mysum=0; LOCK(lockvar);
lo = pid * maxsize/nproc; sum +=mysum;
// nproc is the number of UNLOCk (lockvar);
processes/threads
Pseudo code for sequential program explained
• Variables declared globally can be shared

• Array, lock-var and join-var are accessible to all threads
• Each thread computes the local sum in mysum
• Since values maybe different with each thread, use critical section to add
mysum in serial fashion
• All all the threads to complete local summation before adding to global.
• For barrier synchronization – variable is joinvar
• Barrier makes the thread to wait unitill all threads have reached the barrier
• Barriers and locks can be implemented using Load, Store and Read-modify-
Writ instructions in the ISA
Parallel Programming
• OpenMP (C, C++ & Fortran):
• For distributed memory and shared memory systems we normally use C and its extension
• POSIX threads: pthreads
• For shared memory based communication we can create threads using pthreads that is p thread library
• Library support is required for doing this
• MPI (Message Passing Interface):
• For message passing, the MPI interface is very popular and is used in shared as well as distributed memory
system
• Implementation is based on libraries, functions and macros
• CUDA and OpenCL:
• GPU programming for parallel processing.
• Java Parallel Streams:
• Multithreading using Java's parallel stream APIs.
• Python:
• Libraries like multiprocessing, Dask, and Ray for parallelism.
Case Study: deadline 26th December
• What makes Google the best search engine? Explain the role of
parallel processing in enabling Google to handle large-scale web
searches efficiently.
Google’s Algorithm
• Google algorithm is a program used by Google for ranking websites on its
result pages among organic result
• To list a few basic types of Google algorithms, we have
• Page Rank Algorithm
• It is the heart of google searching
• Developed by Larry Page and Sergy Bin
• Based on the hyperlinks map
• An excellent way to prioritize the results of the web keyword searches
• Panda Algorithm
• Launched on Feb 2011
• Its main target was to detect “content farms” and to block them from showing in Google
search results.
• A content farm or content mill is a company that employs freelance creators or uses
automated tools to generate a large amount of web content which is specifically designed to
satisfy algorithms for maximal retrieval by search engines, known as SEO (search engine
optimization)
• Penguin Algorithm
• Google Penguin was launched in April 2012
• Fighting against spam sites and erasing them
• A search engine update that aims to reduce web spam and improve the quality of search
results:
• Basically it is a place where all your hard work is stored
• Humming code Algorithm
• Launched in Sept 2013
• was aimed to improve the indexing of information rather than sorting through
information.
• The name originates from being precise and fast
• It made a biggest change in Google history
• This algorithm doesn’t effect SEO
• Pigeon Algorithm
• This update was released on 2014
• Was designed to improve ranking parameters based on distance and location
• Aimed to increase the ranking of local listing in a search
• Here you need to check local.google.com and see whether your business is listed already
or not, follow the online instruction and select a verification method.
• Google Fred Algorithm

• Updated on March 2017
• Designed by Google to target black-hat tactics that are connected to overly aggressive
monetization
• Google Fred specifically look s for excessive ads, low value content and websites that
generally offer very little user benefit
2019 Update : Broad Core Algorithm
• It is all about providing a better user experience
• It is believed that Google makes tweaks to its algorithm at least twice per day, which
means 600 to 700 updates a year
• Google updates are “broad” in the sense that they don’t target any specific. Rather they
are designed to improve Google’s system overall
• So the pages that drop in ranking are not being penalized, they are being reassessed
against other web content that has been published since the last update
………………………………………….
……………………………………………………………………..
…………………………………………………………………………………………..
Case Study:
• Understanding Parallel Processing in Google:
• Explain the role of parallel processing in enabling Google to handle large-scale web searches efficiently.
• How does dividing tasks into smaller sub-tasks and processing them concurrently improve the speed and
reliability of search results?
• Major Advancements:
• Discuss how Google's PageRank algorithm uses parallel processing to rank web pages.
• Explore the MapReduce framework and its significance in processing massive datasets. How does this
framework utilize parallel processing to simplify complex computational tasks?
• Role of Machine Learning:

• How has the integration of machine learning enhanced Google's search engine capabilities? Highlight the use
of neural networks in understanding user intent and improving search relevance.
• Comparative Analysis:
• Compare Google’s approach to parallel processing with another search engine (e.g., Bing or Yahoo). What
unique strategies make Google superior?
• Engineering Insights:
• As an ME engineering student, suggest how principles of parallel processing could be applied to other fields of
engineering, such as mechanical systems simulation or optimization
This case study encourages critical thinking and interdisciplinary application of

parallel processing concepts
Set A
1. Explain the significance of parallel processing in modern computing. How
does parallelism improve system performance, and what are the key
challenges associated with implementing parallel algorithms?
2. Why is parallel programming essential in modern computing? Discuss the
key characteristics of parallel algorithms and explain how they influence
the design and efficiency of parallel systems.
3. Explain the architectural classification of computer systems based on
Flynn's taxonomy. Provide examples for each category and discuss their
relevance in parallel computing.
4. Is it possible to develop a parallelizing compiler that transforms a
sequential program into a multithreaded program? Provide justification
for your answer.
Set B
1. In the example of adding ‘𝑛’ numbers using ‘𝑝’ processing
elements, can granularity be achieved in a cost-optimal manner?
Provide justification for your answer.
2. Differentiate between data and instruction. Compare and contrast
between Von-Neumann and Harvard architecture.
3. "Is it possible to implicitly reduce the size of an instruction? Explain
your reasoning.“
4. What are the different parallel programming models used in parallel
computing? Explain each model with examples and discuss how
they impact the performance and scalability of parallel systems.

1. Adv CA - slide deck 1

Uploaded by

Document Informationclick to expand document informationheid

Document Informationclick to expand document information

Copyright:

Available Formats

1. Adv CA - slide deck 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1. Adv CA - slide deck 1

Uploaded by

Copyright:

Available Formats

Advanced Computer

Architecture CMP 511

Department of Computer Engineering

Gap too large to

In its broadest definition,

In its broadest definition,

IAS Machine. Design directed by John von Neumann.

Photo of Berkeley RISC I, © University of California (Berkeley)

1955-60 High level Languages - Fortran 1956

1955-60 High level Languages - Fortran 1956

Machines required experienced operators

• Most users could not be expected to understand

Alpha Reg-Reg 3 0 64-bit 32 64-bit Workstation

ARM Reg-Reg 3 0 32/64-bit 16 32/64-bit Cell Phones,

TI C6000 Reg-Reg 3 0 32-bit 32 32-bit DSP

IBM 360 Reg-Mem 2 1 32-bit 16 24/31/64 Mainframe

x86 Reg-Mem 2 1 8/16/32/ 4/8/24 16/32/64 Personal

Mot. 6800 Accum. 1 1/2 8-bit 0 16-bit Microcontroler

• From this data, we get K = 1, K’ = 1, D = 4, D’ = 1, W = 64, W’ = 8 . So,

Case: Processor A is faster than processor B

Therefore, Processor A is 1.25 times faster than processor B

CPU time = Seconds = Instructions x Cycles x Seconds

“How can we increase the performance on hardware and software while

To improve the performance we can either

• Therefore the efficiency

𝑡𝑜𝑡𝑎𝑙 𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑓𝑜𝑟 𝑛−𝑡𝑎𝑠𝑘 𝑖𝑛 𝑛𝑜𝑛 𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒

“Can we reduce the size of the instructions?”

Opcode Address of Address of Destination Address of next

Fig. Instruction set format

From kai Hwuang, 3e, p19

From kai Hwuang, 3e, p21

From kai Hwuang, 3e, p21

From kai Hwuang, 3e, p22

From kai Hwuang, 3e, p24

Classification based on memory arrangement

• A third choice is somewhere in between: groups of processors

Shared Memory Multiprocessor

• Processors perform the same functions under control of an

Shared Memory Multiprocessor

• A subclass of NUMA system is cache coherent NUMA (cc-

Message Passing Multicomputer

Message Passing Multicomputer

• Notice that, the further the physical distance to the remote

//shared varibales: // pid = process ID, o to nproc-1

• Variables declared globally can be shared

• Google Fred Algorithm

• Role of Machine Learning:

This case study encourages critical thinking and interdisciplinary application of

You might also like