1. Adv CA - slide deck 1
1. Adv CA - slide deck 1
1. Adv CA - slide deck 1
Physics
What is Computer Architecture?
Application
Physics
What is Computer Architecture?
Application
Gates
Circuits Technology Constraints:
Devices • Restrict what can be done efficiently
• New technologies make new arch
Physics possible
Computers Then…
Routers Robots
Smart
phones
Automobiles
Supercomputers 12
Major
Technology
Generations Bipolar
CMOS
nMOS
Vacuum
pMOS
Tubes
Relays
[from Kurzweil]
Electromechanical
Sequential Processor Performance
From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Sequential Processor Performance
RISC
From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Sequential Processor Performance
Move to multi-processor
RISC
From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.
Course Structure
• Recommended Readings
• In-Lecture Questions
• Problem Sets
– Very useful for exam preparation
– Peer Evaluation
• Midterm
• Final Exam
Course Content Computer Organization
Computer Organization
• Basic Pipelined
Processor
~50,000 Transistors
ALU
Where Do Operands Come from And Where
Do Results Go?
ALU
Memory
Where Do Operands Come from And Where
Do Results Go?
…
Processor
ALU
Memory
Where Do Operands Come from And Where
Do Results Go?
Where Do Operands Come from And Where Do
Results Go?
Stack
…
TOS
Processor
ALU
…
Memory
Where Do Operands Come from And Where
Do ResultsStack
Go? Accumulator
…
TOS
Processor
Processor
ALU ALU
… …
Memory
Memory
Where Do Operands Come from And Where
Do ResultsStack
Go? Accumulator Register-
Memory
… …
TOS
Processor
Processor
Processor
ALU ALU ALU
… … …
Memory
Memory
Memory
Where Do Operands Come from And Where
Do ResultsStack
Go? Accumulator Register- Register-
Register Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
… … … …
Memory
Memory
Memory
Memory
Where Do Operands Come from And Where
Do ResultsStack
Go? Accumulator Register- Register-
Register Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
… … … …
Memory
Memory
Memory
Memory
Machine Model Summary
Register- Register-
Stack Accumulator Register
Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
… … … …
Memory
Memory
Memory
Memory
Machine Model Summary
Register- Register-
Stack Accumulator Register
Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
C=A+B
… … … …
Memory
Memory
Memory
Memory
Machine Model Summary
Register- Register-
Stack Accumulator Register
Memory
… … …
TOS
Processor
Processor
Processor
Processor
ALU ALU ALU ALU
C=A+B
… … … …
Memory
Memory
Memory
Memory
Push A Load A Load R1, A Load R1, A
Push B Add B Add R3, R1, B Load R2, B
Add Store C Store R3, C Add R3, R1, R2
Pop C Store R3, C
Real World Instruction Sets
Arch Type # Oper # Mem Data Size # Regs Addr Size Use
“TI-ASC has a word length of 64 bits and there are 4 arithmetic pipelines in it, each
pipeline consisting of 8 stages. Thus, 8 x 4 =32 bits are processed together at one time.
Therefore, TI-ASC will be represented on graph as (64,32)”
Handler’s Classification
• In 1977, Wolfgang Handler presented a computer architectural
classification scheme for determining the degree of parallelism and
pipelining built into the computer system hardware.
• In Handler’s classification pipeline processing systems are divided into
three subsystems:
• Processor Control Unit (PCU): Each PCU corresponds to one processor or one
CPU.
• Arithmetic Logic Unit (ALU): ALU is equivalent to the processing element (PE).
• Bit Level Circuit (BLC): BLC corresponds to the combinational logic circuit
required for 1-bit operations in ALU.
Handler’s Classification
• Handler’s classification uses three pairs of integers containing 6
independent entities that describe the computer system:
Computer = < K*K’, D*D’, W*W’>
where K = number of processors (PCUs) within the computer
K’ = number of PCUs that can be pipelined
D = number of ALUs (PEs) under the control of PCU
D‘ = number of PEs that can be pipelined
W = word length of a PE
W’ = number of pipeline stages in all PEs
Handler’s Classification: Example
• Example 1: Let us consider an example of Texas Instruments’
Advanced Scientific Computer (TI ASC), which has one controller that
controls 4 arithmetic pipelines, each having a 64-bit word length and
8 pipeline stages.
The total time a CPU spends computing on a given task is called CPU
time or CPU execution time
Computer Performance: CPU execution time
• A program is comprised of a number of instructions executed , measured
in: instructions/program
• The average instruction takes a number of cycles per instruction (CPI) to be
completed, measured in: cycles/instruction, CPI
• CPU has a fixed clock cycle time C = 1/clock rate , measured in:
seconds/cycle
• CPU execution time is the product of the above three parameters as
follows:
T = I x CPI x C
Case Study:
• Total Overhead
𝑇𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = (p x 𝑇𝑝 - 𝑇𝑠 )
where, 𝑇𝑠 → serial run time
𝑇𝑝 → parallel run time with ‘p’
processing elements
𝑇𝑠
• Speedup: S =
𝑇𝑝
Adding n numbers using n processing elements
• It is defined as the ratio of the time taken to solve a problem on a single
processing element to the time required to solve the same problem on a
parallel computer with p identical processing elements.
• Consider the problem of adding n numbers by using n processing elements.
Initially, each processing element is assigned one of the numbers to be
added and, at the end of the computation, one of the processing elements
stores the sum of all the numbers. Assuming that n is a power of two, we
can perform this operation in logn steps by propagating partial sums up a
logical binary tree of processing elements.
• The processing elements are labeled from 0 to 15. Similarly, the 16 numbers
to be added are labeled from 0 to 15.
Adding n numbers using n processing elements
Performance Metrics for Parallel System
• Speedup
𝑇𝑠 Ɵ(𝑛) 𝑛
• We denote speedup for the addition of n numbers is given as, S = = =Ɵ( )
𝑇𝑝 Ɵ(𝐿𝑜𝑔𝑛) 𝐿𝑜𝑔𝑛
• Efficiency
• Efficiency is a measure of the fraction of time for which a processing element is usefully employed
• It is defined as the ratio of speedup to the number of processing elements. In an ideal parallel
system, speedup is equal to ‘p’ and efficiency is equal to 1.
• In practice, speedup is less than ‘p’ and efficiency is between 0 and 1, depending on the
effectiveness with which the processing elements are utilized. We denote efficiency by the symbol
E. Mathematically, it is given by
𝑆
E=𝑝
sum = 0;
int A[maxsize];
int main()
for(i=0; i<maxsize; i++)
{
sum +=A[i];
}
print(sum);
Example: Sum of array elements
Pseudo code for parallel algorithm using Message Passing
………………………………………….
……………………………………………………………………..
…………………………………………………………………………………………..
Case Study:
• Understanding Parallel Processing in Google:
• Explain the role of parallel processing in enabling Google to handle large-scale web searches efficiently.
• How does dividing tasks into smaller sub-tasks and processing them concurrently improve the speed and
reliability of search results?
• Major Advancements:
• Discuss how Google's PageRank algorithm uses parallel processing to rank web pages.
• Explore the MapReduce framework and its significance in processing massive datasets. How does this
framework utilize parallel processing to simplify complex computational tasks?
• Comparative Analysis:
• Compare Google’s approach to parallel processing with another search engine (e.g., Bing or Yahoo). What
unique strategies make Google superior?
• Engineering Insights:
• As an ME engineering student, suggest how principles of parallel processing could be applied to other fields of
engineering, such as mechanical systems simulation or optimization