4/17/2025
VIETNAM NATIONAL UNIVERSITY HANOI (VNU)
VNU INFORMATION TECHNOLOGY INSTITUTE
Computer Architecture
Lecture 1: Computer abstraction & performance
Duy-Hieu Bui, PhD
AIoT Laboratory
Email: hieubd@vnu.edu.vn
https://duyhieubui.github.io
Content adapted from “Computer Organization and Design RISC-V
Edition: The Hardware Software Interface, Second Edition” by David A.
Patterson, John L. Hennessy, published by Morgan Kaufmann. © 2020
Elsevier Inc. All rights reserved.
4/17/2025 VNU-ITI/CICA 2
1
4/17/2025
The Computer Revolution
• Progress in computer technology
– Underpinned by domain-specific accelerators
• Makes novel applications feasible
– Computers in automobiles
– Cell phones
– Human genome project
– World Wide Web
– Search Engines
• Computers are pervasive
4/17/2025 VNU-ITI/CICA 3
Classes of Computers
• Personal computers
– General purpose, variety of software
– Subject to cost/performance tradeoff
• Server computers
– Network based
– High capacity, performance, reliability
– Range from small servers to building sized
4/17/2025 VNU-ITI/CICA 4
2
4/17/2025
Classes of Computers
• Supercomputers
– Type of server
– High-end scientific and engineering calculations
– Highest capability but represent a small fraction of the overall
computer market
• Embedded computers
– Hidden as components of systems
– Stringent power/performance/cost constraints
4/17/2025 VNU-ITI/CICA 5
The PostPC Era
4/17/2025 VNU-ITI/CICA 6
3
4/17/2025
The PostPC Era
• Personal Mobile Device (PMD)
– Battery operated
– Connects to the Internet
– Hundreds of dollars
– Smart phones, tablets, electronic glasses
• Cloud computing
– Warehouse Scale Computers (WSC)
– Software as a Service (SaaS)
– Portion of software run on a PMD and a portion run in the Cloud
– Amazon and Google
4/17/2025 VNU-ITI/CICA 7
Understanding Performance
• Algorithm
– Determines number of operations executed
• Programming language, compiler, architecture
– Determine number of machine instructions executed
per operation
• Processor and memory system
– Determine how fast instructions are executed
• I/O system (including OS)
– Determines how fast I/O operations are executed
4/17/2025 VNU-ITI/CICA 9
4
4/17/2025
Seven Great Ideas
• Use abstraction to simplify design
• Make the common case fast
• Performance via parallelism
• Performance via pipelining
• Performance via prediction
• Hierarchy of memories
• Dependability via redundancy
4/17/2025 VNU-ITI/CICA 10
Below Your Program
• Application software
– Written in high-level language
• System software
– Compiler: translates HLL code to machine code
– Operating System: service code
• Handling input/output
• Managing memory and storage
• Scheduling tasks & sharing resources
• Hardware
– Processor, memory, I/O controllers
4/17/2025 VNU-ITI/CICA 11
5
4/17/2025
Levels of Program Code
• High-level language
– Level of abstraction closer to
problem domain
– Provides for productivity and
portability
• Assembly language
– Textual representation of
instructions
• Hardware representation
– Binary digits (bits)
– Encoded instructions and
data
4/17/2025 VNU-ITI/CICA 12
Components of a Computer
• Same components for
all kinds of computer
– Desktop, server,
embedded
• Input/output includes
– User-interface devices
• Display, keyboard, mouse
– Storage devices
• Hard disk, CD/DVD, flash
– Network adapters
• For communicating with other
computers
4/17/2025 VNU-ITI/CICA 13
6
4/17/2025
Touchscreen
• PostPC device
• Supersedes keyboard
and mouse
• Resistive and Capacitive
types
– Most tablets, smart phones
use capacitive
– Capacitive allows multiple
touches simultaneously
4/17/2025 VNU-ITI/CICA 14
Through the Looking Glass
• LCD screen: picture elements (pixels)
– Mirrors content of frame buffer memory
4/17/2025 VNU-ITI/CICA 15
7
4/17/2025
Opening the Box
4/17/2025 VNU-ITI/CICA 16
Inside the Processor (CPU)
• Datapath: performs operations on data
• Control: sequences datapath, memory, ...
• Cache memory
– Small fast SRAM memory for immediate access to data
4/17/2025 VNU-ITI/CICA 17
8
4/17/2025
Inside the Processor
• A12 processor
4/17/2025 VNU-ITI/CICA 18
Abstractions
• Abstraction helps us deal with complexity
– Hide lower-level detail
• Instruction set architecture (ISA)
– The hardware/software interface
• Application binary interface
– The ISA plus system software interface
• Implementation
– The details underlying and interface
4/17/2025 VNU-ITI/CICA 19
9
4/17/2025
A Safe Place for Data
• Volatile main memory
– Loses instructions and data when power off
• Non-volatile secondary memory
– Magnetic disk
– Flash memory
– Optical disk (CDROM, DVD)
4/17/2025 VNU-ITI/CICA 20
Networks
• Communication, resource sharing, nonlocal access
• Local area network (LAN): Ethernet
• Wide area network (WAN): the Internet
• Wireless network: WiFi, Bluetooth
4/17/2025 VNU-ITI/CICA 21
10
4/17/2025
Technology Trends
• Electronics
technology continues
to evolve
– Increased capacity
and performance
– Reduced cost
DRAM capacity
Year Technology Relative performance/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000
4/17/2025 VNU-ITI/CICA 22
Semiconductor Technology
• Silicon: semiconductor
• Add materials to transform properties:
– Conductors
– Insulators
– Switch
4/17/2025 VNU-ITI/CICA 23
11
4/17/2025
Manufacturing ICs
• Yield: proportion of working dies per wafer
4/17/2025 VNU-ITI/CICA 24
Intel® Core 10th Gen
• 300mm wafer, 506 chips, 10nm technology
• Each chip is 11.4 x 10.7 mm
4/17/2025 VNU-ITI/CICA 25
12
4/17/2025
Integrated Circuit Cost
• Nonlinear relation to area and defect rate
– Wafer cost and area are fixed
– Defect rate determined by manufacturing process
– Die area determined by architecture and circuit design
Cost per wafer
Cost per die =
Dies per wafer Yield
Dies per wafer Wafer area Die area
1
Yield =
(1+ (Defects per area Die area/2)) 2
4/17/2025 VNU-ITI/CICA 26
Defining Performance
• Which airplane has the best performance?
4/17/2025 VNU-ITI/CICA 27
13
4/17/2025
Response Time and Throughput
• Response time
– How long it takes to do a task
• Throughput
– Total work done per unit time
• e.g., tasks/transactions/… per hour
• How are response time and throughput affected
by
– Replacing the processor with a faster version?
– Adding more processors?
• We’ll focus on response time for now…
4/17/2025 VNU-ITI/CICA 28
Relative Performance
• Define Performance = 1/Execution Time
• “X is n time faster than Y”
Performanc e X Performanc e Y
= Execution time Y Execution time X = n
• Example: time taken to run a program
– 10s on A, 15s on B
– Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
– So A is 1.5 times faster than B
4/17/2025 VNU-ITI/CICA 29
14
4/17/2025
Measuring Execution Time
• Elapsed time
– Total response time, including all aspects
• Processing, I/O, OS overhead, idle time
– Determines system performance
• CPU time
– Time spent processing a given job
• Discounts I/O time, other jobs’ shares
– Comprises user CPU time and system CPU time
– Different programs are affected differently by CPU and system
performance
4/17/2025 VNU-ITI/CICA 30
CPU Clocking
• Operation of digital hardware governed by a
constant-rate clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state
• Clock period: duration of a clock cycle
– e.g., 250ps = 0.25ns = 250×10–12s
• Clock frequency (rate): cycles per second
– e.g., 4.0GHz = 4000MHz = 4.0×109Hz
4/17/2025 VNU-ITI/CICA 31
15
4/17/2025
CPU Time
• Performance improved by
– Reducing number of clock cycles
– Increasing clock rate
– Hardware designer must often trade off clock rate against cycle
count
CPU Time = CPU Clock Cycles Clock Cycle Time
CPU Clock Cycles
=
Clock Rate
4/17/2025 VNU-ITI/CICA 32
CPU Time Example
• Computer A: 2GHz clock, 10s CPU time
• Designing Computer B
– Aim for 6s CPU time
– Can do faster clock, but causes 1.2 × clock cycles
• How fast must Computer B clock be?
Clock CyclesB 1.2 Clock CyclesA
Clock RateB = =
CPU Time B 6s
Clock CyclesA = CPU Time A Clock Rate A
= 10s 2GHz = 20 10 9
1.2 20 10 9 24 10 9
Clock RateB = = = 4GHz
6s 6s
4/17/2025 VNU-ITI/CICA 33
16
4/17/2025
Instruction Count and CPI
• Instruction Count for a program
– Determined by program, ISA and compiler
• Average cycles per instruction
– Determined by CPU hardware
– If different instructions have different CPI
• Average CPI affected by instruction mix
Clock Cycles = Instructio n Count Cycles per Instructio n
CPU Time = Instructio n Count CPI Clock Cycle Time
Instructio n Count CPI
=
Clock Rate
4/17/2025 VNU-ITI/CICA 34
CPI Example
• Computer A: Cycle Time = 250ps, CPI = 2.0
• Computer B: Cycle Time = 500ps, CPI = 1.2
• Same ISA
• Which is faster, and by how much?
CPU Time = Instructio n Count CPI Cycle Time
A A A
= I 2.0 250ps = I 500ps A is faster…
CPU Time = Instructio n Count CPI Cycle Time
B B B
= I 1.2 500ps = I 600ps
B = I 600ps = 1.2
CPU Time
…by this much
CPU Time I 500ps
A
4/17/2025 VNU-ITI/CICA 35
17
4/17/2025
CPI in More Detail
• If different instruction classes take different numbers of
cycles
n
Clock Cycles = (CPIi Instruction Count i )
i=1
◼ Weighted average CPI
Clock Cycles n
Instruction Count i
CPI = = CPIi
Instruction Count i=1 Instruction Count
Relative frequency
4/17/2025 VNU-ITI/CICA 36
CPI Example
• Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
◼ Sequence 1: IC = 5 ◼ Sequence 2: IC = 6
◼ Clock Cycles ◼ Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
◼ Avg. CPI = 10/5 = 2.0 ◼ Avg. CPI = 9/6 = 1.5
4/17/2025 VNU-ITI/CICA 37
18
4/17/2025
Performance Summary
• Performance depends on
– Algorithm: affects IC, possibly CPI
– Programming language: affects IC, CPI
– Compiler: affects IC, CPI
– Instruction set architecture: affects IC, CPI, Tc
Instructions Clock cycles Seconds
CPU Time =
Program Instruction Clock cycle
4/17/2025 VNU-ITI/CICA 38
Power Trends
• In CMOS IC technology
Power = Capacitive load Voltage 2 Frequency
×30 5V → 1V ×1000
4/17/2025 VNU-ITI/CICA 39
19
4/17/2025
Reducing Power
• Suppose a new CPU has
– 85% of capacitive load of old CPU
– 15% voltage and 15% frequency reduction
Pnew Cold 0.85 (Vold 0.85)2 Fold 0.85
= = 0.85 4 = 0.52
Cold Vold Fold
2
Pold
• The power wall
– We can’t reduce voltage further
– We can’t remove more heat
• How else can we improve performance?
4/17/2025 VNU-ITI/CICA 40
Uniprocessor Performance
Constrained by power, instruction-level parallelism,
memory latency
4/17/2025 VNU-ITI/CICA 41
20
4/17/2025
Multiprocessors
• Multicore microprocessors
– More than one processor per chip
• Requires explicitly parallel programming
– Compare with instruction level parallelism
• Hardware executes multiple instructions at once
• Hidden from the programmer
– Hard to do
• Programming for performance
• Load balancing
• Optimizing communication and synchronization
4/17/2025 VNU-ITI/CICA 42
SPEC CPU Benchmark
• Programs used to measure performance
– Supposedly typical of actual workload
• Standard Performance Evaluation Corp (SPEC)
– Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2006
– Elapsed time to execute a selection of programs
• Negligible I/O, so focuses on CPU performance
– Normalize relative to reference machine
– Summarize as geometric mean of performance ratios
• CINT2006 (integer) and CFP2006 (floating-point)
n
n
Execution time ratio
i=1
i
4/17/2025 VNU-ITI/CICA 43
21
4/17/2025
SPECspeed 2017 Integer benchmarks on a
1.8 GHz Intel Xeon E5-2650L
4/17/2025 VNU-ITI/CICA 44
SPEC Power Benchmark
• Power consumption of server at different workload levels
– Performance: ssj_ops/sec
– Power: Watts (Joules/sec)
10 10
Overall ssj_ops per Watt = ssj_opsi poweri
i =0 i =0
4/17/2025 VNU-ITI/CICA 45
22
4/17/2025
SPECpower_ssj2008 for Xeon E5-2650L
4/17/2025 VNU-ITI/CICA 46
Pitfall: Amdahl’s Law
• Improving an aspect of a computer and
expecting a proportional improvement in overall
performance
Taffected
Timproved = + Tunaffected
improvemen t factor
◼ Example: multiply accounts for 80s/100s
◼ How much improvement in multiply performance to
get 5× overall?
80 ◼ Can’t be done!
20 = + 20
n
◼ Corollary: make the common case fast
4/17/2025 VNU-ITI/CICA 47
23
4/17/2025
Fallacy: Low Power at Idle
• Look back at i7 power benchmark
– At 100% load: 258W
– At 50% load: 170W (66%)
– At 10% load: 121W (47%)
• Google data center
– Mostly operates at 10% – 50% load
– At 100% load less than 1% of the time
• Consider designing processors to make power
proportional to load
4/17/2025 VNU-ITI/CICA 48
Pitfall: MIPS as a Performance Metric
• MIPS: Millions of Instructions Per Second
– Doesn’t account for
• Differences in ISAs between computers
• Differences in complexity between instructions
Instruction count
MIPS =
Execution time 10 6
Instruction count Clock rate
= =
Instruction count CPI CPI 10 6
10 6
Clock rate
◼ CPI varies between programs on a given CPU
4/17/2025 VNU-ITI/CICA 49
24
4/17/2025
Concluding Remarks
• Cost/performance is improving
– Due to underlying technology development
• Hierarchical layers of abstraction
– In both hardware and software
• Instruction set architecture
– The hardware/software interface
• Execution time: the best performance measure
• Power is a limiting factor
– Use parallelism to improve performance
4/17/2025 VNU-ITI/CICA 50
25