計算機組織
Computer Organization
Computer Abstractions and
Technology
Kun-Chih (Jimmy) Chen 陳坤志
kcchen@nycu.edu.tw
Institute of Electronics,
National Yang Ming Chiao Tung University
NYCU EE / IEE
The Computer Revolution
❖ Progress in computer technology
❖ Underpinned by Moore’s Law
❖ Makes novel applications feasible
❖ Computers in automobiles
❖ Cell phones
❖ Human genome project
❖ World Wide Web
❖ Search Engines
❖ Computers are pervasive
P2
Hollerith Tabulating Machine – CTR (1990)
❖ It opened the world’s eyes to the very idea of data processing.
❖ Beginning with the 1890 US census
❖ Save the data collection time from 144.5 hours to 72.5 hours
❖ Save the tabulating time from 55.5 hours to 5.5 hours
P3
From CTR to IBM (1896 → 1924)
❖ Herman Hollerith is the founder of Computing Tabulating Recording
(CTR)
❖ The main business is census (人口普查)
❖ Thomas J. Watson rebrand CTR to International Business Machines
Corporation (IBM)
P4
Turing Machine
❖ The idea of a universal computational device was first described by
Alan Turing in 1937.
❖ all computation could be performed by a special kind of a machine
P5
Von Neumann Model
❖ Around 1944–1945, John von Neumann proposed that
❖ since program and data are logically the same, programs should also be
stored in the memory of a computer.
Von Neumann with the first Institute computer
P6
Four Subsystems
❖ Computers built on the von Neumann model divide the computer
hardware into four subsystems
❖ Memory: 電路學、電子學、超大型積體電路
❖ arithmetic logic unit: 邏輯設計、計算機結構、數位系統設計、超大型積體
電路
❖ control unit: 邏輯設計、計算機結構、數位系統設計…
❖ Input/output
P7
ENIAC (1945)
❖ ENIAC (Electronic Numerical Integrator and Computer) was the first
programmable, electronic, general-purpose digital computer
P8
First Transistor (1947)
❖ In celebration of ENIAC’s 50th anniversary, the machine was
reimplemented using CMOS technology in 1995. The room—size
computer could now fit in the palm of your hand!
P9
Google’s first server (1999)
P10
Moore’s Law and Technology Scaling
❖ “… the performance of an IC, including the number components
on it, doubles every 18-24 months with the chip price …”
– Gordon Moore 1965
P11
Moore’s Law in Action
❖ Example:
❖ Intel Core i7 microprocessor evolution
➢ Six-core Core i7 (Gulftown), 2010
➢ 32nm technology
➢ 1,170 M transistors
➢ Die size: 240 mm2
➢ Transistor density: 4.875 M/mm2
➢ 10-core Core i7 (Broadwell-E), 2016
➢ 14nm technology
➢ 3,200 M transistors
➢ Die size: 246 mm2
➢ Transistor density: 13.008 M/mm2
Source: Intel
P12
Reference Size
❖ Cell size (diameter): 10 – 100 um
❖ Erythrocyte size (diameter): 6 – 9 um
❖ Protein size (diameter): 1 – 20 nm
❖ DNA molecule size (diameter): 2 nm
P13
IC 60 (1958 - 2018)
P14
Classes of Computers
❖ Personal computers
❖ General purpose, variety of software
❖ Subject to cost/performance tradeoff
❖ Server computers
❖ Network based
❖ High capacity, performance, reliability
❖ Range from small servers to building sized
❖ Supercomputers
❖ High-end scientific and engineering calculations
❖ Highest capability but represent a small fraction of the overall computer
market
❖ Embedded computers
❖ Hidden as components of systems
❖ Stringent power/performance/cost constraints
P15
Understanding Performance
❖ Algorithm
❖ Determines number of operations executed
❖ Programming language, compiler, architecture
❖ Determine number of machine instructions executed per operation
❖ Processor and memory system
❖ Determine how fast instructions are executed
❖ I/O system (including OS)
❖ Determines how fast I/O operations are executed
P16
Eight Great Ideas
❖ Design for Moore’s Law
❖ Use abstraction to simplify design
❖ Make the common case fast
❖ Performance via parallelism
❖ Performance via pipelining
❖ Performance via prediction
❖ Hierarchy of memories
❖ Dependability via redundancy
P17
Below Your Program
❖ Application software
❖ Written in high-level language
❖ System software
❖ Compiler: translates HLL code to machine code
❖ Operating System: service code
➢ Handling input/output
➢ Managing memory and storage
➢ Scheduling tasks & sharing resources
❖ Hardware
❖ Processor, memory, I/O controllers
P18
Levels of Program Code
❖ High-level language
❖ Level of abstraction closer to problem
domain
❖ Provides for productivity and portability
❖ Assembly language
❖ Textual representation of instructions
❖ Hardware representation
❖ Binary digits (bits)
❖ Encoded instructions and data
P19
Semiconductor Technology
❖ Silicon: semiconductor
❖ Add materials to transform properties:
❖ Conductors
❖ Insulators
❖ Switch
P20
Technology Trends
❖ Electronics technology
continues to evolve
❖ Increased capacity and
performance
❖ Reduced cost
DRAM capacity
Year Technology Relative performance/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000
P21
VLSI IC Technology
2001 2005 2010 2016
Line width (nm) 130 80 45 22
Clock (GHz) 1.7 5.2 11.5 28.8
DRAM cost
7.7 1.9 0.34 0.042
(microcents/bit)
MPU cost
97 24 4.31 0.54
(microcent/trans)
Supply voltage(V) 1.2 1.0 0.8 0.6
Wiring levels 7 9 10 10
cost per transistor chip density
P22
Manufacturing ICs
❖ Yield: proportion of working dies per wafer
P23
Intel Core i7 Wafer
❖ 300mm wafer, 280 chips, 32nm technology
❖ Each chip is 20.7 x 10.5 mm
P24
Integrated Circuit Cost
Cost per wafer
Cost per die =
Dies per wafer Yield
Dies per wafer Wafer area Die area
1
Yield =
(1+ (Defects per area Die area/2)) 2
❖ Nonlinear relation to area and defect rate
❖ Wafer cost and area are fixed
❖ Defect rate determined by manufacturing process
❖ Die area determined by architecture and circuit design
P25
Performance
❖ Purchasing perspective: Given a collection of machines, which has the
❖best performance? least cost?
❖best performance/cost?
❖ Design perspective: Faced with design options, which has the
❖best performance improvement? least cost?
❖best performance improvement/cost?
❖ Both require
❖basis for comparison
❖metric for evaluation
❖ Goal: understand cost and
performance implications of
architectural choices
P26
Performance Measures
❖ Typical performance metrics
❖ Response time (latency): time between start and completion
❖ Throughput (bandwidth) -- rate: work done per unit time
❖ Speedup of X relative to Y
❖ Execution timeY / Execution timeX
❖ B is n times faster than A
➢ Means exec_time_A/exec_time_B == rate_B/rate_A
❖ Execution time
❖ Wall clock time (or response time, or elapsed time): includes all system
overheads
❖ CPU time: only computation time
P27
Measuring Performance
❖ Time is the measure of computer performance
❖ Elapsed time (Wall-clock time) = program execution + I/O + wait
❖ important to user
❖ Execution time = user time + system time (but OS self measurement
may be inaccurate)
❖ CPU performance = user time on unloaded system
❖ important to architect
P28
Real Performance Measurement
❖ Benchmark suites
❖ Attempts at running programs that ate much simpler than a real application
have led to performance pitfalls
❖ Performance is the result of executing a workload on a configuration
❖ Workload = program + input
❖ Configuration = CPU + cache + memory + I/O + OS + compiler
optimizations
❖ compiler optimizations can make a huge difference!
P29
Improve Performance by
❖ Changing the
❖ algorithm (ex. Bubble sorting vs. Quick sorting)
❖ data structures (ex. Structure vs. pointer)
❖ programming language (ex. C vs. Matlab)
❖ compiler
❖ compiler optimization flags
❖ OS parameters (ex. Windows vs. Linux)
❖ Improving locality of memory or I/O accesses
❖ Overlapping I/O (ex. Single direction port vs. bi-direction port)
❖ On multiprocessors, you can improve performance by avoiding cache
coherency problems (e.g., false sharing) and synchronization
problems
P30
CPU Clocking
❖ Operation of digital hardware governed by a constant-rate clock
❖ Clock period: duration of a clock cycle
❖ e.g., 250ps = 0.25ns = 250×10–12s
❖ Clock frequency (rate): cycles per second
❖ e.g., 4.0GHz = 4000MHz = 4.0×109Hz
Clock period
Clock (cycles)
Data transfer
and computation
Update state
P31
CPU Time
CPU Time = CPU Clock Cycles Clock Cycle Time
CPU Clock Cycles
=
Clock Rate
❖ Performance improved by
❖ Reducing number of clock cycles
❖ Increasing clock rate
❖ Hardware designer must often trade off clock rate against cycle count
P32
CPU Time Example
❖ Computer A: 2GHz clock, 10s CPU time
❖ Designing Computer B
❖ Aim for 6s CPU time
❖ Can do faster clock, but causes 1.2 × clock cyclesA
❖ How fast must Computer B clock be?
Clock CyclesB 1.2 Clock CyclesA
Clock RateB = =
CPU Time B 6s
Clock CyclesA = CPU Time A Clock Rate A
= 10s 2GHz = 20 109
1.2 20 109 24 109
Clock RateB = = = 4GHz
6s 6s
P33
Instruction Count and CPI
Clock Cycles = Instruction Count Cycles per Instruction
CPU Time = Instruction Count CPI Clock Cycle Time
Instruction Count CPI
=
Clock Rate
❖ Instruction Count for a program
❖ Determined by program, ISA and compiler
❖ Average cycles per instruction
❖ Determined by CPU hardware
❖ If different instructions have different CPI
➢ Average CPI affected by instruction mix
P34
What is Instruction Count?
❖ How many C operations?
for(i=0; i<10; i++)
a[i] = b[i] * c[i];
❖ How many Assembly instructions?
sub $r1,$r2,$r3
Loop: beq $r9,$r0,End
add $r8,$r8,$r10
addi $r9,$r9,-1 10 times => 41 instructions
j Loop
End:
Dynamic Instruction Count
P35
CPI Example
❖ Computer A: Cycle Time = 250ps, CPI = 2.0
❖ Computer B: Cycle Time = 500ps, CPI = 1.2
❖ Same ISA
❖ Which is faster, and by how much?
CPU Time = Instruction Count CPI Cycle Time
A A A
= I 2.0 250ps = I 500ps A is faster…
CPU Time = Instruction Count CPI Cycle Time
B B B
= I 1.2 500ps = I 600ps
CPU Time
B = I 600ps = 1.2
…by this much
CPU Time I 500ps
A
P36
CPI in More Detail
❖ If different instruction classes take different numbers of cycles
n
Clock Cycles = (CPIi Instruction Counti )
i=1
❖ Weighted average CPI
Clock Cycles n
Instruction Count i
CPI = = CPIi
Instruction Count i=1 Instruction Count
Relative frequency
Instruction Frequency
P37
CPI Example
❖ Alternative compiled code sequences using instructions in classes A, B,
C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
❖ Sequence 1: IC =5 ❖ Sequence 2: IC = 6
❖ Clock Cycles ❖ Clock Cycles
= 2×1 + 1×2 + 2×3 = 10 = 4×1 + 1×2 + 1×3
❖ Avg. CPI = 10/5 = 2.0 =9
❖ Avg. CPI = 9/6 = 1.5
P38
Performance Summary
CPU execution time for program (designer’s view)
= Clock Cycles for program x Clock Cycle Time
❖ Substituting for clock cycles:
CPU execution time for program (user’s view)
= Instruction Count x CPI x Clock Cycle Time
CPU time = Instructions Cycles Seconds
Program Instruction Cycle
P39
Performance Summary
The BIG Picture
Instructio ns Clock cycles Seconds
CPU Time =
Program Instructio n Clock cycle
❖ Performance depends on
❖ Algorithm: affects IC, possibly CPI
❖ Programming language: affects IC, CPI
❖ Compiler: affects IC, CPI
❖ Instruction set architecture: affects IC, CPI, Tc (i.e., clock cycle time)
P40
MIPS as a Performance Measure
❖ MIPS: Million Instructions Per Second
Instruct. count
MIPS =
Execution time 106
Instruct. count Clock rate
= =
Instruct. count CPI CPI 10 6
10 6
Clock rate
P41
Pitfall: MIPS as a Performance Metric
❖ MIPS: Millions of Instructions Per Second
❖ Doesn’t account for
➢ Differences in ISAs between computers
➢ Differences in complexity between instructions
Instructio n count
MIPS =
Execution time 106
Instructio n count Clock rate
= =
Instructio n count CPI CPI 10 6
10 6
Clock rate
➢ CPI varies between programs on a given CPU
P42
Example
❖ Consider two compilers with three instruction classes, A, B , and
C, and the CPIs of them are 1, 2, and 3, respectively. Suppose
we measure the code for the same program from two different
compilers and obtain the following data:
Code from Instruction counts (in billions=109)
A B C
Compiler 1 5 1 1
Compiler 2 10 1 1
❖ Assume that the machine’s clock rate is 4GHz. Which code
sequence will execute faster according to MIPS rate? According
to execution time?
P43
Example
Code from Instruction counts (in billions)
A B C
Compiler 1 5 1 1
Compiler 2 10 1 1
CPI 1 2 3
❖ CPU clock cycle1 = (5*1+1*2+1*3)*109 =10*109
❖ CPU clock cycle2 = (10*1+1*2+1*3)*109 =15*109
❖ Execution time1 = (10*109) / (4*109) = 2.5 Sec
❖ Execution time2 = (15*109) / (4*109) = 3.75 Sec
❖ MIPS = IC / (Exe time * 106 )
❖ MIPS1 = [(5+1+1) *109] / (2.5 * 106 ) = 2800
❖ MIPS2 = [(10+1+1) *109] / (3.75 * 106 ) = 3200
P44
Amdahl’s Law: A method to quantify
performance speedup
Gene Amdahl
Computer Pioneer
9/8/2024
P45
Amdahl’s Law
❖ Amdahl’s law gives us a quick way to find the speedup from some
enhancement, which depends on two factors:
❖ Fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement
➢ How much fraction of the computation time can be enhanced?
➢ The fraction is always less than 1
❖ How much faster the task would run if the enhanced mode were used for
the entire program
➢ Speedup = (execution time without enhance.) / (execution time with enhance.)
= (time without) / (time with) = Tw/o / Tw/
P46
Example
Suppose we are considering an enhancement to a web server. The
enhanced CPU is 10 times faster on computation but the same speed on
I/O. Suppose also that 60% of the time is waiting on I/O
Answer:
Fractionenhanced=0.4
Speedupenhanced=10
1 1
Speedupoverall= 0.4 = ≈ 1.56
0.6+ 10 0.64
P47
Power Trends
❖ In CMOS IC technology
Power = Capacitive load Voltage 2 Frequency
×40 5V → 1V ×1000
P48
Reducing Power
❖ Suppose a new CPU has
❖ 85% of capacitive load of old CPU
❖ 15% voltage and 15% frequency reduction
Pnew Cold 0.85 (Vold 0.85) 2 Fold 0.85
= 2
= 0.85 4
= 0.52
Pold Cold Vold Fold
❖ The power wall
❖ We can’t reduce voltage further
❖ We can’t remove more heat
❖ How else can we improve performance?
P49
Power and Energy
❖ Problem: Get power in, get power out
❖ Thermal Design Power (TDP)
❖ Characterizes sustained power consumption
❖ Lower than peak power, higher than average power consumption
❖ Used as target for power supply and cooling system
➢ Two kinds of thermal management in the modern processors
❖ Clock rate can be reduced dynamically to limit power consumption
❖ Energy per task is often a better measurement
P50
Case Study: Profile of Intel i7-8850H
P51
Power
❖ Intel 80386 consumed ~
2W
❖ 3.3 GHz Intel Core i7
consumes 130 W
❖ Heat must be dissipated
from 1.5 x 1.5 cm chip
❖ This is the limit of what
can be cooled by air
P52
Power vs. Thermal Issue
❖ Distributing the power, removing the heat, and preventing hot spot
have become increasingly difficult challenges.
❖ Techniques for reducing power
❖ Do nothing well
❖ Dynamic Voltage-Frequency Scaling
❖ Low power state for DRAM, disks
❖ Overclocking, turning off cores
Leakage Current Increase 5% per 10°C
P53
Case Study: Intel© Turbo Boost Technology 2.0
❖ Features
❖ Under a Thermal Design Power(TDP), maximize the frequency state
depends on the workload and:
➢ Number of active cores
➢ Estimated current consumption
➢ Estimated power consumption
➢ Processor temperature
❖ Commercial Products
❖ Intel Core i7-2xxx series
❖ Intel Core i5-2xxx series
❖ Intel Xeon E3-12xx series
Near zero power for inactive core
P54
Single Processor Performance
Move to multi-processor
RISC
P55
Multiprocessors
❖ Multicore microprocessors
❖ More than one processor per chip
❖ Requires explicitly parallel programming
❖ Compare with instruction level parallelism
➢ Hardware executes multiple instructions at once
➢ Hidden from the programmer
❖ Hard to do
➢ Programming for performance
➢ Load balancing
➢ Optimizing communication and synchronization
P56
Why Do I Want to Know These?
I/O Chan
Link
API
ISA
Technology Interfaces (ISA)
Historic IR
Background, Regs
Trend
Machine Organization
Computer
Applications Architect Measurement &
Evaluation
P57
In Fact, Architecture Design Is an Iterative
Process
-- searching the space of possible designs
-- at all levels of computer systems
New concepts
created
Estimate
Cost & Sort
Performance
Historical background
and understanding of
trends help the Good
selection process Mediocre ideas
ideas
P58