6.
823
Computer System Architecture
Lecturers:
Krste Asanovic, Srini Devadas
Asanovic/Devadas
Spring 2002
6.823
Whats 6.823 About?
Whats under the hood of your desktop?
How powerful is the engine?
How much gas does it burn?
How do you build one?
From the Beta (6.004) RISC processor to the
Pentium-4
We wont emphasize:
VLSI implementations
Parallel processing
Quantitative evaluation
Asanovic/Devadas
Spring 2002
6.823
Course Information
Asanovic/Devadas
Spring 2002
6.823
You must sign up for the course through the web
~7 Homeworks (30%)
Midterm (30%)
Final (40%)
Tutorials : One session / week
1 hour / session
All students must help grade homeworks once during
semester, signup sheets distributed during class
Course Information
(contd.)
Asanovic/Devadas
Spring 2002
6.823
Textbook: Hennessy and Patterson Computer
Architecture: A Quantitative Approach
(strongly recommended)
Prerequisite material: Patterson and Hennessy
Hardware/Software Interface book
Some material in lectures will be from other sources
Tutorials: No new material introduced, reinforcement
of lecture content using examples
Problem Set 0
Asanovic/Devadas
Spring 2002
6.823
Goal is to help you judge for yourself whether you
have prerequisites for this class
We assume that you understand digital logic, a
simple 5-stage pipeline, and simple caches
For this problem set only, work by yourself not
in groups
Due at start of class.
Asanovic/Devadas
Spring 2002
6.823
History of Calculation and Computer
Architecture
Asanovic/Devadas
Spring 2002
6.823
The Importance of Calculation
Plato in Philebus (On Pleasure) 4th century BC
On the very first level of knowledge are to be found number
and calculation
Pythagorean mysticism
Do numbers rule the world?
Scientists needed to escape the hardships of
human calculation
Chinese abacus (suan pan) 13th century AD: First instrument
to provide a simple means to carry out all the operations of
arithmetic
Abacus still in use today!
PLATO
If you cannot calculate, you cannot speculate on future pleasure and your life
will not be that of a human, but that of an oyster or jellyfish.
LUIGI MENABREA
How the prospects of long and arid calculations have demoralized great
thinkers, who seek only time to meditate but see themselves swamped by the
sheer mass of arithmetic to be done by an inadequate system! And yet, it is
the path of laborious analysis that leads to truth, but they cannot follow that
path without employing the guide of number, for which number there is no
way to life the veil which covers the mysteries of nature 1884
CHINESE ABACUS 13th Century AD Several rods, five lower balls on
each rod, 2 upper balls, divided by a slat
JAPANESE ABACUS soroban, more refined, reduced to 1 upper ball, and 4
lower balls
Asanovic/Devadas
Spring 2002
6.823
Calculating Machines
Wilhelm Shickards Calculating Clock (1623) was the
earliest calculating machine in history!
multiplication and division required several interventions by the
operator
Operated on the principle of Napiers bones
Possibility of mechanizing arithmetic was
demonstrated by Blaise Pascal in the machine
Pascaline (1642)
Series of toothed gears, each numbered 0 to 9
First calculating machine to be commercialized!
Arithmometer (1822), keyboard (1850), printer (1885),
Electro-mechanical Arithmometer (1913)
On 20th September 1623, Shickard wrote as follows to his friend Kepler: The
calculations which you do by hand, I have recently attempted to achieve
mechanically I have constructed a machine which, immediately and
automatically, calculates with given numbers, which adds, subtracts,
multiplies and divides. You will cry out with joy when you see how it carries
forward tens and hundreds, or deducts them in subtractions
Kepler sure would have appreicated such an invention to create his tables of
the movements of the planets, but unfortunately, Schickards one and only
copy of his own machine was destroyed by fire on 22 February 1624.
Napiers bones or rods (1617). Ten wooden rods, of square cross-section.
Each of the four sides of each rod corresponded to one of the digits from 0 to
9, and marked down its length in nine divisions, were the multiples of the
corresponding digit. A kind of multiplication table, where you read off the
product horizontally when the rods were placed side by side.
Pascaline was not very reliable. When one wheel completed a revolution, the
next wheel would advance a step. The automatic carrying mechanism tended
to jam when several wheels were simultaneously at 9, necessitating several
simultaneous carries (999 to 1000)
Thomas, (director of a French insurance company) Arithmometer was the first
machine to be commercialized on a large scale.
Asanovic/Devadas
Spring 2002
6.823
Programmability?
Problems with all these machines lay with their
limited capability to carry out a linked sequence
of calculations
Needed to transcribe and enter all intermediate results!
Vaucansons programmable androids (1738)
Vaucanson (1749) constructed the first automatic
loom
Accepted commands by means of a perforated metal cylinder
Jacquard (1804) perfected the programmable
loom
Activated by a sequence of punched-cards!
Basile Bouchon (1725) invented a loom that accepted commands by means
of a punched tape
Vaucanson in 1738 developed the Digesting Duck, an artificial automaton for
Louis XV.
He stretches out his neck to go and take the grain from the hand, he
swallows it, digests it, and excretes it, once digested through the normal
channels; all the gestures of a duck swallowing rapidly, speeding up the
movement in his throat to pass the food into his stomach, are copied from
nature.
Vaucanson turned his loom into a programmable but cylical automaton, one
in which the commands were inscribed by means of perforations on a
hydraulically-driven drum, and were regularly repeated.
Jacquard combined the use of a moving drum equipped with a sequence of
punched cards and the concept of a swinging arm that lifted hooks.
Charles Babbage 1791-1871
Asanovic/Devadas
Spring 2002
6.823
Lucasian Professor of Mathematics
Cambridge University, 1827-1839
Difference Engine
1823
Application?
Mathematical Tables - Astronomy
Nautical Tables Navy
Background
Any continuous function can be approximated
by a polynomial --Weierstrass
Technology
mechanical - gears, Jacquards loom,
simple calculators
First digital calculators with an automatic capability of effecting chained
sequences of operations following a program set up in advance in a control
mechanism were the difference machines.
Difference Engine
Asanovic/Devadas
Spring 2002
6.823
A machine to compute mathematical tables
Weierstrass:
- Any continuous function can be approximated
by a polynomial
- Any polynomial can be computed from
difference tables
f (n) = n2+n+41
d1(n) = f(n) - f(n-1) = 2n
d2(n) = d1(n) - d1(n-1) = 2
f(n) = f(n-1) + d1(n) = f(n-1) + (d1(n-1) + 2)
n
d2(n)
d1(n)
f(n)
41
2
43
2
2
4
47
3
2
4 ...
2
all you need is
an adder!
Asanovic/Devadas
Spring 2002
6.823
Difference Engine
1823 - Babbages paper is published
1834 - The paper is read by Scheutz brothers
in Sweden
1842 - Babbage gives up the idea of building it;
(he is onto Analytic Engine!)
1855 - Scheutz displays his machine at the
Paris World Fare
- Can compute any 6th degree polynomial
- Speed: 33 to 44 32-digit numbers per minute!
Now the machine is at the Smithsonian
Babbage was funded by the British Association for the advancement of
Science, but failed to build the machine.
He was unable to moderate his ambitions, and gradually lost interest in the
original difference engine
Scheutz brothers machine was the first working calculator in history that did
print out the results.
Asanovic/Devadas
Spring 2002
6.823
Analytic Engine
1833 - Babbages paper is published
conceived during a hiatus in the development
of the difference engine
Inspiration:Jacquards Loom
The set of cards with fixed punched holes
dictated the pattern of weave program
The same set of cards could be used with
different colored threads
numbers
1871 - Babbage dies - the machine remains unrealized.
It is not clear if the analytic engine could be
built even today using only mechanical
technology
However, near the end of his life he became depressed. If I could live just a
few more years, the Analytical Engine would exist and its example spread
across the entire planet. Then he added, even more pathetically, If any
man, who is not rebuffed by my example, one day produces a machine
containing within it all of the principles of mathematical analysis, then I have
no fear for my memory, for he alone will be able to appreciate fully the nature
of my efforts and the value of the results I have obtained
Asanovic/Devadas
Spring 2002
6.823
Analytic Engine
The first conception of a general purpose computer
1. The store in which all variables to be operated upon,
as well as intermediate results are placed.
2. The mill into which the quantities about to be
operated upon are always brought.
The program
Operation
variable1
variable2
variable3
An operation in the mill required feeding two punched
cards and producing a new punched card for the store.
An operation to alter the sequence was also provided!
One of the most striking features of Babbages Analytical Engine is the way
conditional operations were to be handled. Proposed that a lever only move
if the result of the calculation was negative, and that is should be used to
advance or roll back the cards on the Jacquard mechanism to any specified
extent.
Asanovic/Devadas
Spring 2002
6.823
The first programmer
Ada Byron aka Lady Lovelace
Babbages ideas had a lot of influence later, primarily
because of
Luigi Menabrea, who published notes of Babbages
lectures in Italy
Lady Lovelace, who translated Menabreas notes
in English and thoroughly expanded them.
... Analytic Engine weaves algebraic patterns....
Adas tutor was Babbage himself!
In the early 20th century - the focus shifted to analog
computers but
Countess Lovelace (1815-1852) was Lord Byrons only daughter. Hard to
imagine a greater contrast between the poet and his daughter, who had
applied herself to the exact and arduous study of calculating machines.
She devised a certain number of programs with the idea of one day
introducing them to the machine of her friend and master. We can say that
the Analytical Engine will weave algebraic patterns, just as Jacquard looms
weave flowers and leaves.
Asanovic/Devadas
Spring 2002
6.823
Harvard Mark I
Built in 1944, in the IBM laboratories at Endicott by
Howard Aiken Professor of Physics at Harvard
Essentially mechanical but had some electro-
magnetically controlled relays and gears
Weighed 5 tons and had 750,000 components
A synchronizing clock that beat every 0.015 seconds
Performance:
0.3 seconds for addition
6 seconds for multiplication
1 minute for a sine calculation
Broke down once a week!
Over 500 miles of electrical wiring, and 3 million solder joints!
72 registers of 23 bits each
At best took a few minutes to repair, averaged 20 minutes, sometimes
several hours.
During the final months of World War II it was used exclusively by the US
Navy to solve problems in ballistics.
Decommissioned in 1959.
Linear Equation Solver
Asanovic/Devadas
Spring 2002
6.823
John Atanasoff, Iowa State University
1930s: Atanasoff and Clifford Berry built the
Linear Equation Solver. It had 300 tubes!
Application:
Linear and Integral differential equations
Background:
Vannevar Bushs Differential Analyzer
--- an analog computer
Technology:
Tubes and Electromechanical relays
Atanasoff decided that the correct mode of
computation was by electronic digital means.
The physical and logical structures of the machine were fairly rudimentary
and it was never an analytical calculator in the true sense of the term (in that
it was not programmable). Furthermore, it never worked properly.
Asanovic/Devadas
Spring 2002
6.823
Electronic Numerical Integrator
and Computer (ENIAC)
Inspired by Atanasoff and Berry, Eckert and
Mauchly designed and built ENIAC (1943-45)
The first, completely electronic, operational,
general-purpose analytical calculator!
30 tons, 72 square meters, 200KW
Performance
Read in 120 cards per minute
Addition took 200 s, Division 6 ms
1000 times faster than Mark I
Not very reliable!
Designers were obliged to install a ventilation shaft and a cooling system in
the room. Machine had 18,000 vacuum tubes.
The slogan was: The ENIAC is able to calculate the trajectory of a largecaliber naval shell in less time than the shell takes to reach its target!
To change a program, it was necessary to change the instructions, the
connector panels, and the positions of the switches all at the same time.
The inventors of ENIAC themselves admitted that, after taking into account
human error, the machine only got the correct result 20 times out of 100!
Asanovic/Devadas
Spring 2002
6.823
Electronic Discrete Variable
Automatic Computer (EDVAC)
ENIACs programming system was external
Sequences of instructions were executed independently of the
results of the calculation
Human intervention required to take instructions out of order
Eckert, Mauchly, John von Neumann and others
designed EDVAC (1944) to solve this problem
Solution was the stored program computer
First Draft of a report on EDVAC was published in
1945, but just had von Neumanns signature!
In 1973 the court of Minneapolis attributed the
honor of inventing the computer to John
Atanasoff
Von Neumann (1903-1957) was born in Budapest. Great quantities of work
in set theory to quantum physics. Celebrated for his theory of games and its
application to economics.
Report was the formal specification of EDVAC. Before that the documentation
on ENIAC or EDVAC was non-existent. Von Neumann devised a
mathematical logical notation which expressed fundamental ideas of the
fetch-execute-decode loop.
Five major components: Arithmetic unit, control unit, the memory, input
devices and output devices.
Stored Program Computer
Asanovic/Devadas
Spring 2002
6.823
Program = A sequence of instructions
How to control instruction sequencing?
manual control
calculators
automatic control
external (paper tape)
internal
plug board
read-only memory
read-write memory
Harvard Mark I , 1944
Zuses Z1, WW2
ENIAC 1946
ENIAC 1948
EDVAC 1947 (concept )
The same storage can be used to store
program and data
EDSAC
1950
Maurice Wilkes
EDSAC: Electronic Delay Storage Automatic Calculator, translation into
machine language of commands input in symbolic form; automatic loading of
the program into core memory, etc.
Asanovic/Devadas
Spring 2002
6.823
The First True Computers
Eckert and Mauchly founded their own company
and built the BINAC 1947-49
Two processors that checked each other for reliability
Whirlwind I of MIT was another von Neumann
computer built between 1946-51 by Jay Forrester
Had a magnetic-core memory and a programming language
Used by the US Air Defense
First commercial American computer was
UNIVAC-I designed and built by Eckert and
Mauchly
Used for opinion polls in 1952 Presidential elections
BINAC was the first electronic computer built in the United States.
There were several von Neumann computer efforts, including SEAC.
UNIVAC-I. CBS television polls, accurately predicted Eisenhowers victory.
Dominant Problem: Reliability
Asanovic/Devadas
Spring 2002
6.823
Mean time between failures (MTBF)
MITs Whirlwind with an MTBF of 20 min. was perhaps
the most reliable machine !
Reasons for unreliability:
1. Vacuum Tubes
2. Storage medium
acoustic delay lines
mercury delay lines
Williams tubes
Selections
CORE
J. Forrester
1951
Asanovic/Devadas
Spring 2002
6.823
Commercial Activity -- 1948-52
IBMs SSEC
Selective Sequence Electronic Calculator
Vacuum tubes in the control unit, and electromagnetic relays everywhere else
150 word store, stored program machine
Instructions, constraints, and tables of data were
read from paper tapes.
66 Tape reading stations!
Serious inconsistencies of logic.
Exhibited by IBM at the start of 1948 in the shop windows of a busy New York
avenue.
Fascinated the public, who came in their thousands to look at the lights on the
calculator blinking away.
Was a near-computer because there a lack of synchronization in the
calculation due to its hybrid nature.
IBM Computers
Asanovic/Devadas
Spring 2002
6.823
IBM 701 -- 30 machines were sold in 1953-54
IBM 650 -- a cheaper, drum based machine,
more than 120 were sold in 1954
and there were orders for 750 more!
Users stopped building their own machines.
Why was IBM late getting into computer
technology?
They were making too much money!
Even without computers, IBM
revenues were doubling every
4 to 5 years in 40s and 50s.
Defense Calculator was another name for the IBM 701.
Commissioned during the Korean War.
Software Developments
Asanovic/Devadas
Spring 2002
6.823
up to 1955
Libraries of numerical routines
Floating point operations
Transcendental functions
Matrix manipulation, equation solvers, . . .
1955-60
High level Languages - Fortran 1956
Operating Systems Assemblers, Loaders, Linkers, Compilers
Accounting programs to keep track of
usage and charges
Machines required experienced operators
Most users could not be expected to understand
these programs, much less write them
Machines had to be sold with a lot of resident
software
Asanovic/Devadas
Spring 2002
6.823
Factors that Influence
Computer Architecture
Technology
Applications
Computer Architecture
Software
Compatibility
Software played almost no role in defining an
architecture before mid fifties.
special-purpose versus general-purpose machines
Asanovic/Devadas
Spring 2002
6.823
Compatibility
Essential for portability and competition
Its importance increases with the market size
but it is also the most regressive force
Instruction Set Architecture (ISA) compatibility
The same assembly program can run on any
upward compatible model
then IBM 360/370 ...
now Intel x86
System and application software developers expect
more than ISA compatibility
(APIs)
applications
operating system
proc + mem + I/O
Java?
Wintel
Microprocessor Economics
Asanovic/Devadas
Spring 2002
6.823
Designing a state-of-the-art microprocessor
requires a huge design team
Pentium
~300 engineers
PentiumPro ~ 500 engineers
Huge investments in fabrication lines
need to sell 2 to 4 million units to be profitable
Continuous improvements are needed to improve
yields and clock speed
price drops to one tenth in 2-3 years
Fast new processors also require new peripheral
chips (memory controller, I/O)
$$$
Cost of launching a new ISA is prohibitive and
the advantage is not so clear!
View of Computer Architecture
Language/ Compiler/
System software designer
Need mechanisms
to support important
abstractions
Determine compilation
strategy; new language
abstractions
Asanovic/Devadas
Spring 2002
6.823
Architect/Hardware
designer
Decompose each
mechanism into essential
micro-mechanisms and
determine its feasibility
and cost effectiveness
Propose mechanisms and
features for performance
Architects main concerns are cost-performance,
performance, and efficiency in supporting a broad
class of software systems.
Asanovic/Devadas
Spring 2002
6.823
Influence of Technology and
Software on Instruction Sets:
Up to the dawn of IBM 360
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Importance of Technology
Asanovic/Devadas
Spring 2002
6.823
New technologies not only provide greater
speed, size and reliability at lower cost, but
more importantly these dictate the kinds of
structures that can be considered and thus
come to shape our whole view of what a
computer is.
Bell & Newell
Technology is the dominant
factor in computer design
Asanovic/Devadas
Spring 2002
6.823
Technology
Transistors
Integrated circuits
VLSI (initially)
Laser disk, CDs
Computers
Technology
Core memories
Magnetic tapes
Disks
Computers
Technology
ROMs, RAMs
VLSI
Packaging
Low Power
Computers
But Software...
Asanovic/Devadas
Spring 2002
6.823
As people write programs and use computers,
our understanding of programming and program
behavior improves.
This has profound though slower impact on
computer architecture
Modern architects cannot avoid paying attention
to software and compilation issues.
Technology
Computers
Software
Asanovic/Devadas
Spring 2002
6.823
The Earliest Instruction Sets
Single Accumulator - A carry-over from the calculators.
LOAD
STORE
x
x
AC M[x]
M[x] (AC)
ADD
SUB
x
x
AC (AC) + M[x]
MUL
DIV
x
x
Involved a quotient register
AC 2 (AC)
SHIFT LEFT
SHIFT RIGHT
JUMP
JGE
x
x
PC x
if (AC) 0 then PC x
LOAD ADR
STORE ADR
x
x
AC Extract address field(M[x])
Typically less than 2 dozen instructions!
Programming a Single
Accumulator Machine
Ci Ai + Bi, 1 i n
LOOP LOAD
JGE
ADD
STORE
F1
LOAD
F2
ADD
F3
STORE
JUMP
DONE HLT
N
DONE
ONE
N
A
B
C
LOOP
How to modify the
addresses A, B and C ?
Asanovic/Devadas
Spring 2002
6.823
N
ONE
code
-n
1
Self-Modifying Code
Asanovic/Devadas
Spring 2002
6.823
Ci Ai + Bi, 1 i n
LOOP LOAD
JGE
ADD
STORE
F1
LOAD
F2
ADD
F3
STORE
LOAD ADR
ADD
STORE ADR
modify the
LOAD ADR
program
ADD
for the next
STORE ADR
iteration
LOAD ADR
ADD
STORE ADR
JUMP
DONE HLT
N
DONE
ONE
N
A
F1
ONE
F1
F2
ONE
F2
F3
ONE
F3
LOOP
Each iteration involves
total book-keeping
instruction
fetches
17
14
operand
fetches
10
stores
Processor State
Asanovic/Devadas
Spring 2002
6.823
The information held in the processor at the end of an
instruction to provide the processing context for the next
instruction. e.g. Program Counter, Accumulator, . . .
Programmer visible state of the processor (and memory)
plays a central role in computer organization:
Software can only manipulate programmer-visible
state and can only rely on programmer-visible state
Hardware must never let software observe state
changes other than that defined in programmers
manual (e.g., interrupts not visible to running program)
Programmers model of machine is a contract
between hardware and software
Accumulator Machines
Can describe any possible computation using
single accumulator instruction set
Hardware is very simple
Why did different instruction sets evolve?
Asanovic/Devadas
Spring 2002
6.823
Computers in mid 50s
Asanovic/Devadas
Spring 2002
6.823
Hardware was expensive
Programmers view of the machine was inseparable
from the actual hardware implementation
Example: IBM 650 - a drum machine with 44 instructions
1. 60 1234 1009
Load the contents of location 1234 into the
distribution; put it also into the upper accumulator;
set lower accumulator to zero; and then go to location
1009 for the next instruction.
Good programmers optimized the placement of
instructions on the drum to reduce latency
2. Branch on distribution digit equal to 8
Asanovic/Devadas
Spring 2002
6.823
Computers in mid 50s (cont.)
Stores were small (1000 words) and 10 to 50 times
slower than the processor
1. Instruction execution time was totally dominated
by the memory reference time.
More memory references per instruction
longer execution time per instruction
2. The ability to design complex control circuits to
execute an instruction was the central concern
as opposed to the speed of decoding or ALU.
3. No resident system software because there was no
place to keep it!
Asanovic/Devadas
Spring 2002
6.823
Processor
Bottleneck!
Memory
Some early solutions:
fast local storage in the processor, i.e., 8-16 registers
as opposed to one accumulator
indexing capability to reduce book keeping
instructions
complex instructions to reduce instruction fetches
compact instructions, i.e., implicit address bits for
operands, to reduce fetches
Index Registers
Asanovic/Devadas
Spring 2002
6.823
Tom Kilburn, Manchester University, mid 50s
One or more specialized registers to simplify address
calculation
Modify existing instructions
LOAD
ADD
x, IX
x, IX
AC M[x + (IX)]
AC (AC) + M[x + (IX)]
Add new instructions to manipulate index registers
JZi
x, IX
LOADi
x, IX
if (IX)=0 then PC x
else IX (IX) + 1
IX M[x] (truncated to fit IX)
Index registers have accumulator-like characteristics
Using Index Registers
C i Ai + B i
Asanovic/Devadas
Spring 2002
6.823
1 i n
A
LOADi
LOOP JZi
LOAD
ADD
STORE
JUMP
DONE HALT
-n, IX
DONE, IX # Jump or increment IX
A, IX
B, IX
C, IX
LASTA
LOOP
Program does not modify itself
Efficiency has improved dramatically (ops / iter)
with index regs
without index regs
instruction fetch
5 (2)
17 (14)
operand fetch
2
10 (8)
store
1
5 (4)
Costs: Instructions are 1 to 2 bits longer
Index registers with ALU-like circuitry
Complex control
Asanovic/Devadas
Spring 2002
6.823
Indexing vs. Index Registers
LOAD x, IX
Suppose instead of registers, memory locations
are used to implement index registers.
Arithmetic operations on index registers can be
performed by bringing the contents to the accumulator.
Most book keeping instructions will be avoided
but each instruction will implicitly cause several
fetches and stores.
complex control circuitry
additional memory traffic
Asanovic/Devadas
Spring 2002
6.823
Operations on Index Registers
To increment index register by k
AC (IX)
AC (AC) + k
IX (AC)
new instruction
new instruction
also the AC must be saved and restored.
It may be better to increment IX directly
INCi k, IX
IX (IX) + k
More instructions to manipulate index register
STOREi x, IX
...
M[x] (IX) (extended to fit a word)
IX begins to look like an accumulator
several index registers
several accumulators
General Purpose Registers
Support for Subroutine Calls
Main
Program
F:
call F(a1,...)
a1
a2
Subroutine F
call F(b1,...)
b1
b2
return
A special subroutine jump instruction
M:
JSR
Asanovic/Devadas
Spring 2002
6.823
F M + 1 and
jump to F+1
Indirect Addressing and
Subroutine Calls Subroutine
Asanovic/Devadas
Spring 2002
6.823
F
F+1
Caller
6 Events:
Execute M
Execute F+1
Execute S1
Execute S2
Execute S3
Execute M+3
M JSR F
arg
result
M+3
Indirect addressing
LOAD
(x)
S1 LOAD (F)
inc F by 1
fetch
arg
S2 STORE(F)
inc F by 1 store
result
means AC M[M[x]]
S3 JUMP (F)
Indirect addressing almost eliminates the need to write
self-modifying code (location F still needs to be modified)
Problems with recursive procedure calls
Recursive Procedure Calls
and Reentrant Codes
Asanovic/Devadas
Spring 2002
6.823
Indirect Addressing through a register
LOAD
R1, (R2)
Load register R1 with the contents of the
word whose address is contained in register R2
Registers
Memory
Pure Code
PC
SP
Data
Stack
Asanovic/Devadas
Spring 2002
6.823
Evolution of Addressing Modes
1. Single accumulator, absolute address
LOAD
2. Single accumulator, index registers
LOAD
x, IX
3. Indirection
LOAD
(x)
4. Multiple accumulators, index registers, indirection
or
LOAD
LOAD
R, IX, x
R, IX, (x)
the meaning?
R M[M[x] + IX]
or R M[M[x + IX]]
5. Indirect through registers
LOAD
RI, (RJ)
6. The works
LOAD
RI, RJ, (RK)
RJ = index, RK = base address
Asanovic/Devadas
Spring 2002
6.823
Variety of Instruction Formats
Two address formats: the destination is same as
one of the operand sources
(Reg Reg) to Reg
(Reg Mem) to Reg
R I RI + R J
RI RI + M[x]
...
x could be specified directly or via a register;
effective address calculation for x could include
indexing, indirection, ...
Three operand formats: One destination and up to
two operand sources per instruction
(Reg x Reg) to Reg
(Reg x Mem) to Reg
...
R I RJ + RK
RI RJ + M[x]
Many different formats are possible!
Data Formats and
Memory Addresses
Asanovic/Devadas
Spring 2002
6.823
Data formats:
Bytes, Half words, words and double words
Some issues
Byte addressing
Big Endian
0
1
2
3
vs. Little Endian
3
2
1
0
Word alignment
Suppose the memory is organized in 32-bit words.
Can a word address begin only at 0, 4, 8, .... ?
0
Some Problems
Asanovic/Devadas
Spring 2002
6.823
Should all addressing modes be provided for
every operand?
regular vs. irregular instruction formats
Separate instructions to manipulate
Accumulators
Index registers
Base registers
A large number of instructions
Instructions contained implicit memory references several contained more than one
very complex control
Compatibility Problem at IBM
Asanovic/Devadas
Spring 2002
6.823
By early 60s, IBM had 4 incompatible lines of computers!
701
650
702
1401
7094
7074
7080
7010
Each system had its own
Instruction set
I/O system and Secondary Storage:
magnetic tapes, drums and disks
assemblers, compilers, libraries,...
market niche
business, scientific, real time, ...
IBM 360
IBM 360 : Design Premises
Asanovic/Devadas
Spring 2002
6.823
Amdahl, Blaauw and Brooks, 1964
The design must lend itself to growth and
successor machines
General method for connecting I/O devices
Total performance - answers per month rather than
bits per microsecond programming aids
Machine must be capable of supervising itself
without manual intervention
Built-in hardware fault checking and locating aids
to reduce down time
Simple to assemble systems with redundant I/O
devices, memories etc. for fault tolerance
Some problems required floating point words larger
than 36 bits
IBM 360:
A General-Purpose Register Machine
Asanovic/Devadas
Spring 2002
6.823
Processor State
16 General-Purpose 32-bit Registers
- may be used as index and base registers
- Register 0 has some special properties
4 Floating Point 64-bit Registers
A Program Status Word (PSW)
PC, Condition codes, Control flags
A 32-bit machine with 24-bit addresses
No instruction contains a 24-bit address!
Data Formats
8-bit bytes, 16-bit half-words, 32-bit words,
64-bit double-words
IBM 360: Implementations
Storage
Datapath
Circuit Delay
Local Store
Control Store
Model 30
...
8K - 64 KB
8-bit
30 nsec/level
Main Store
Read only 1sec
Asanovic/Devadas
Spring 2002
6.823
Model 70
256K - 512 KB
64-bit
5 nsec/level
Transistor Registers
Conventional circuits
IBM 360 instruction set architecture completely hid the
underlying technological differences between various
models.
With minor modifications it survives today
Asanovic/Devadas
Spring 2002
6.823
IBM S/390 z900 Microprocessor
64-bit virtual addressing
first 64-bit S/390 design (original S/360 was 24-bit, and S/370 was
31-bit extension)
1.1 GHz clock rate (announced ISSCC 2001)
0.18m CMOS, 7 layers copper wiring
770MHz systems shipped in 2000
Single-issue 7-stage CISC pipeline
Redundant datapaths
every instruction performed in two parallel datapaths and results
compared
256KB L1 I-cache, 256KB L1 D-cache on-chip
20 CPUs + 32MB L2 cache per Multi-Chip Module
Water cooled to 10oC junction temp
What makes a good
instruction set?
Asanovic/Devadas
Spring 2002
6.823
Ideally, provides simple software interface yet allows
simple, fast, efficient hardware implementations
but across 25+ year time frame
Example of difficulties:
Current machines have register files with more storage
than entire main memory of early machines!
On-chip test circuitry of current machines hundreds of
times more transistors than entire early computers!
Asanovic/Devadas
Spring 2002
6.823
Full Employment for Architects
Good news: Ideal instruction set changes continually
Technology allows larger CPUs over time
Technology constraints change (e.g., power is now major constraint)
Compiler technology improves over time (e.g., register allocation)
Programming style varies over time (assembly coding, HLL, objectoriented, )
Applications and application demands vary over time (e.g.,
multimedia processing recent change in workload)
Bad news: Software compatibility imposes huge
damping coefficient on instruction set innovation
Software investment dwarfs hardware investment
Innovate at microarchitecture level, below instruction set level, this is
what most computer architects do
New instruction set can only be justified by new large
market and technological advantage
Network processors
Multimedia processors
Asanovic/Devadas
Spring 2002
6.823
Complex Instruction Set
Evolution in the Sixties:
Stack and GPR Architectures
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
The Sixties
Asanovic/Devadas
Spring 2002
6.823
Hardware costs started dropping
memories beyond 32K words seemed likely
separate I/O processors
large register files
Systems software development became essential
Operating Systems
I/O facilities
Separation of Programming Model from implementation
become essential
family of computers
The Burroughs B5000:
Asanovic/Devadas
Spring 2002
6.823
An ALGOL Machine, Robert Barton, 1960
Machine implementation can be completely hidden if the
programmer is provided only a high-level language
interface.
Stack machine organization because stacks are
convenient for:
1. expression evaluation;
2. subroutine calls, recursion, nested interrupts;
3. accessing variables in block-structured languages.
B6700, a later model, had many more innovative features
tagged data
virtual memory
multiple processors and memories
Asanovic/Devadas
Spring 2002
6.823
A Stack Machine
A Stack machine has a stack as a part of the processor
state
typical operations:
push
Processor
pop
+
stack
Main
*...
Store
Instructions like + implicitly
:
specify the top 2 elements
of the stack as operands.
push b
b
a
push c
c
b
a
pop
b
a
Evaluation of Expressions
Asanovic/Devadas
Spring 2002
6.823
(a + b * c) / (a + d * c - e)
/
+
a
*
c
*
d
Reverse Polish
abc* +adc*+e-/
Push a
push c multiply
Push b
c
b*c
b
a
Evaluation Stack
Evaluation of Expressions
Asanovic/Devadas
Spring 2002
6.823
(a + b * c) / (a + d * c - e)
/
+
a
*
c
*
d
Reverse Polish
abc*+adc*+e-/
add
b*c
a+b*c
a
Evaluation Stack
Hardware Organization
of the Stack
Asanovic/Devadas
Spring 2002
6.823
Stack is part of the processor state
stack must be bounded and small
number of Registers and not the size of
main memory
Conceptually stack is unbounded
a part of the stack is included in the processor
state; the rest is kept in the main memory
Stack Size and
Memory References
Program
push a
push b
push c
*
+
push a
push d
push c
*
+
push e
/
Asanovic/Devadas
Spring 2002
6.823
abc*+adc*+e-/
Stack (size = 2)
Memory Refs
a
a
a,
b
b
b,
c
c, ss(a)
a,
b*c
sf(a)
a+b*c
a+b*c,
a
a
a,
d
d, ss(a+b*c)
d,
c
c, ss(a)
a,
d*c
sf(a)
a+b*c,
a+d*c
sf(a+b*c)
a+d*c,
e
e, ss(a+b*c)
a+b*c,
a+d*c-e
sf(a+b*c)
(a+b*c)/(a+d*c-e)
4 stores, 4 fetches (implicit)
Stack Operations and
Implicit Memory References
Asanovic/Devadas
Spring 2002
6.823
When just the top 2 elements of the stack are kept in
registers and the rest is kept in the memory:
Each push operation
pop operation
1 memory reference.
1 memory reference.
No Good!
Better performance from keeping the top N elements in
registers and by making memory references only when
register stack overflows or underflows.
Issue - when to Load/Unload registers ?
Stack Size and
Expression Evaluation
abc*+adc*+e-/
a and c are
loaded twice
not the best
use of registers!
push a
push b
push c
*
+
push a
push d
push c
*
+
push e
/
R0
R0 R1
R0 R1 R2
R0 R1
R0
R0 R1
R0 R1 R2
R0 R1 R2 R3
R0 R1 R2
R0 R1
R0 R1 R2
R0 R1
R0
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
Register Usage in a GPR Machine
(a + b * c)/(a + d * c - e)
Load
Load
Load
Mul
Add
Load
Mul
Add
Load
Sub
Div
R0
R1
R2
R2
R2
R3
R3
R3
R0
R3
R2
a
c
b
R1
R0
d
R1
R0
e
R0
R3
More control over register
usage since registers can
be named explicitly
Load
Ri m
Load
Ri (Rj)
Load
Ri (Rj) (Rk)
- eliminates unnecessary
Loads and Stores
- fewer Registers
but instructions may be longer!
B5000 Procedure Calls
Asanovic/Devadas
Spring 2002
6.823
Storage for procedure calls also follows a stack
discipline
However, there is a need to access variables
beyond the current stack frame
- lexical addressing < ll , d >
- display registers to speed up
accesses to stack frames
R
Proc P
Proc Q
Proc R
Q
R
Q
automatic loading
of display registers?
Q
R
3
2
ll = 1
display
registers
stack
static dynamic
links link
Stack Machine Features
In addition to push, pop,
+ etc., the instruction
set must provide the
capability to
machinery to
carry out
+, -, etc.
refer to any element in
the data area
jump to any instruction in
the code area
move any element in
the stack frame to the top
stack
SP
DP
PC
Asanovic/Devadas
Spring 2002
6.823
a
b
c
.
.
.
data
push a
push b
push c
*
+
push e
/
code
Asanovic/Devadas
Spring 2002
6.823
Stack versus GPR Organization
Amdahl, Blaauw and Brooks, 1964
1. The performance advantage of push down stack
organization is derived from the presence of fast
registers and not the way they are used.
2.Surfacing of data in stack which are profitable is
approximately 50% because of constants and common
sub-expressions.
3. Advantage of instruction density because of implicit
addresses is equaled if short addresses to specify
registers are allowed.
4. Management of finite depth stack causes complexity.
5. Recursive subroutine advantage can be realized only
with the help of an independent stack for addressing.
6. Fitting variable length fields into fixed width word is
awkward.
Stack Machines
Asanovic/Devadas
Spring 2002
6.823
(Mostly)
Died by 1980
1. Stack programs are not smaller if short (Register)
addresses are permitted.
2. Modern compilers can manage fast register space
better than the stack discipline.
3. Lexical addressing is a useful abstract model for
compilers but hardware support for it (i.e. display)
is not necessary.
GPRs and caches are better than stack and displays
Early language-directed architectures often did not
take into account the role of compilers!
B5000, B6700, HP 3000, ICL 2900, Symbolics 3600
Stacks post-1980
Asanovic/Devadas
Spring 2002
6.823
Inmos Transputers (1985-2000)
Designed to support many parallel processes in Occam language
Fixed-height stack design simplified implementation
Stack trashed on context swap (fast context switches)
Inmos T800 was worlds fastest microprocessor in late 80s
Forth machines
Direct support for Forth execution in small embedded real-time
environments
Several manufacturers (Rockwell, Patriot Scientific)
Java Virtual Machine
Designed for software emulation not direct hardware execution
Sun PicoJava implementation + others
Intel x87 floating-point unit
Severely broken stack model for FP arithmetic
Deprecated in Pentium-4 in exchange for SSE2 FP register arch.
IBM 360:
A General-Purpose Register
Machine
Processor State
Asanovic/Devadas
Spring 2002
6.823
16 General-Purpose 32-bit Registers
- may be used as index and base registers
- Register 0 has some special properties
4 Floating Point 64-bit Registers
A Program Status Word (PSW)
PC, Condition codes, Control flags
A 32-bit machine with 24-bit addresses
No instruction contains a 24-bit address !
Data Formats
8-bit bytes, 16-bit half-words, 32-bit words,
64-bit doublewords
IBM 360: Implementations
Storage
Datapath
Circuit Delay
Local Store
Control Store
Model 30
...
8K - 64 KB
8-bit
30 nsec/level
Main Store
Read only 1sec
Asanovic/Devadas
Spring 2002
6.823
Model 70
256K - 512 KB
64-bit
5 nsec/level
Transistor Registers
Conventional circuits
IBM 360 instruction set architecture completely hid the
underlying technological differences between various
models.
Asanovic/Devadas
Spring 2002
6.823
IBM 360: Precise Interrupts
IBM 360 ISA (Instruction Set Architecture)
preserves sequential execution model
Programmers view of machine was that each
instruction either completed or signaled a fault
before next instruction began execution
Exception/interrupt behavior constant across
family of implementations
IBM 360:
Some Addressing Modes
8
RR
opcode
8
RX
R1
4
opcode R1
Asanovic/Devadas
Spring 2002
6.823
R1 R1 op R2
R2
4
12
X2
B2
D2
R1 R1 op M[X2 + B2 + D2]
a 24-bit address is formed by adding the
12-bit displacement (D) to a base register (B)
and an Index register (X), if desired
The most common formats for arithmetic & logic
instructions, as well as Load and Store instructions
IBM 360:
Branches & Condition Codes
Asanovic/Devadas
Spring 2002
6.823
Arithmetic and logic instructions set condition codes
equal to zero
greater than zero
overflow
carry...
I/O instructions also set condition codes - channel busy
All conditional branch instructions are based on testing
these condition code registers (CCs)
RX and RR formats
BC_
branch conditionally
BAL_
branch and link, i.e., R15 PC + 1
for subroutine calls
CCs must be part of the PSW
IBM 360:
Character String Operations
Asanovic/Devadas
Spring 2002
6.823
opcode
length
B1
12
D1
B2
12
D2
SS format: store to store instructions
M[B1 + D1] M[B1 + D1] op M[B2 + D2]
iterate length times
most operations on decimal and character strings
use this format
MVC move characters
MP
multiply two packed decimal strings
CLC compare two character strings
...
a lot of memory operations per instruction
complicates exception & interrupt handling
Microarchitecture
Asanovic/Devadas
Spring 2002
6.823
Implementation of an ISA
status
lines
Controller
control
points
Data
path
Structure: How components are connected.
Static
Sequencing: How data moves between components
Dynamic
The DLX ISA
Processor State
32 32-bit GPRs, R0 always contains a 0
32 single precision FPRs, may also be viewed as
16 double precision FPRs
FP status register, used for FP compares & exceptions
PC, the program counter
some other special registers
Data types
8-bit byte, 2-byte half word
32-bit word for integers
32-bit word for single precision floating point
64-bit word for double precision floating point
Load/Store style instruction set
data addressing modes- immediate & indexed
branch addressing modes- PC relative & register indirect
Byte addressable memory- big-endian mode
All instructions are 32 bits
(See Chapter 2, H&P for full description)
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
A Bus-based Datapath for DLX
Opcode
ldIR
busy
zero?
OpSel
ldA
32(PC)
31(Link)
rf3
rf2
rf1
ldB
2 /
IR
ExtSel Imm
/
2
Ext
enImm
rf3
rf2
rf1
ALU
control
addr
32 GPRs
+ PC ...
ALU
32-bit Reg
enALU
data
RegSel
ldMA
MA
addr
RegWrt
enReg
Memory
MemWrt
data
enMem
Bus 32
Microinstruction: register to register transfer (17 control signals)
MA PC
means
RegSel = PC; enReg=yes; ldMA= yes
B Reg[rf2] means
RegSel = rf2; enReg=yes; ldB = yes
Asanovic/Devadas
Spring 2002
6.823
Memory Module
addr
busy
RAM
din
we
Write/Read
Enable
dout
bus
We will assume that Memory operates asynchronously
and is slow as compared to Reg-to-Reg transfers
DLX ALU Instructions
6
5
opcode rf1
5
rf2
5
rf3
Asanovic/Devadas
Spring 2002
6.823
11
function
Register-Register form:
Reg[rf3] function(Reg[rf1], Reg[rf2])
6
5
opcode rf1
5
rf2
16
immediate
Register-Immediate form:
Reg[rf2] function(Reg[rf1], SignExt(immediate))
Asanovic/Devadas
Spring 2002
6.823
Instruction Execution
Execution of a DLX instruction involves
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. memory operation (optional)
5. write back to register file (optional)
and the computation of the address of the
next instruction
Microcontrol Unit
Asanovic/Devadas
Spring 2002
6.823
Maurice Wilkes, 1954
Embed the control logic state table in a memory array
op
conditional
code flip-flop
Next state
address
Matrix A
Matrix B
Decoder
Control lines to
ALU, MUXs, Registers
Microprogram Fragments
instr fetch: MA PC
IR Memory
A PC
PC A + 4
dispatch on OPcode
Asanovic/Devadas
Spring 2002
6.823
can be
treated as
a macro
ALU:
A Reg[rf1]
B Reg[rf2]
Reg[rf3] func(A,B)
do instruction fetch
ALUi:
A Reg[rf1]
B Imm
sign extension ...
Reg[rf2] Opcode(A,B)
do instruction fetch
Asanovic/Devadas
Spring 2002
6.823
DLX Load/Store Instructions
6
5
opcode rf1
base
5
rf2
16
displacement
Load/store byte, halfword, word to/from GPR:
LB, LBU, SB, LH, LHU, SH, LW, SW
byte and half-word can be sign or zero extended
Load/store single and double FP to/from FPR:
LF, LD, SF, SD
Byte addressable machine
Memory access must be data aligned
A single addressing mode
(base) + displacement
Big-endian byte ordering
31
DLX Control Instructions
Asanovic/Devadas
Spring 2002
6.823
Conditional branch on GPR
6
5
opcode rf1
16
offset from PC+4
BEQZ, BNEZ
Unconditional register-indirect jumps
6
5
opcode rf1
16
JR, JALR
Unconditional PC-relative jumps
6
opcode
26
offset from PC+4
J, JAL
PC-offset are specified in bytes
jump-&-link (JAL) stores PC+4 into the link register (R31)
(Real DLX has delayed branches ignored this lecture)
Microprogram Fragments
LW:
J:
beqz:
bz-taken:
(cont.)
A Reg[rf1]
B Imm
MA A + B
Reg[rf2] Memory
do instruction fetch
A PC
B Imm
PC A + B
do instruction fetch
A Reg[rf1]
If zero?(A) then go to bz-taken
do instruction fetch
A PC
B Imm
PC A + B
do instruction fetch
Asanovic/Devadas
Spring 2002
6.823
DLX Microcontroller:
Opcode
zero?
Busy (memory)
Asanovic/Devadas
Spring 2002
6.823
first attempt
PC (state)
latching the inputs
may cause a
one-cycle delay
s
addr
ROM Size ?
How big is s?
Program ROM
2(opcode+status+s) words
word = (control+s) bits
data
next state
17 Control Signals
Asanovic/Devadas
Spring 2002
6.823
Microprogramming
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Asanovic/Devadas
Spring 2002
6.823
Instruction Set Architecture
(ISA) versus Implementation
ISA is the hardware/software interface
Defines set of programmer visible state
Defines instruction format (bit encoding) and instruction
semantics
Examples: DLX, x86, IBM 360, JVM
Many possible implementations of one ISA
360 implementations: model 30 (c. 1964), z900 (c. 2001)
x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486,
Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon,
Transmeta Crusoe, SoftPC
DLX implementations: microcoded, pipelined, superscalar
JVM: HotSpot, PicoJava, ARM Jazelle, ...
ISA to Microarchitecture
Mapping
Asanovic/Devadas
Spring 2002
6.823
ISA often designed for particular microarchitectural
style, e.g.,
CISC ISAs designed for microcoded implementation
RISC ISAs designed for hardwired pipelined implementation
VLIW ISAs designed for fixed latency in-order pipelines
JVM ISA designed for software interpreter
But ISA can be implemented in any microarchitectural
style
Pentium-4: hardwired pipelined CISC (x86) machine (with some
microcode support)
This lecture: a microcoded RISC (DLX) machine
Intel will probably eventually have a dynamically scheduled out-oforder VLIW (IA-64) processor
PicoJava: A hardware JVM processor
Asanovic/Devadas
Spring 2002
6.823
Microcoded Microarchitecture
Microcode instructions fixed
in ROM inside microcontroller
busy?
zero?
opcode
controller
Datapath
Data
Addr
Memory
Memory (RAM) holds user program
written using macrocode
instructions (e.g., DLX, x86, etc.)
enMem
MemWrt
Asanovic/Devadas
Spring 2002
6.823
A Bus-based Datapath for DLX
Opcode
ldIR
busy
zero?
OpSel
ldA
32(PC)
31(Link)
rf3
rf2
rf1
ldB
2 /
IR
ExtSel Imm
/
2
Ext
enImm
rf3
rf2
rf1
ALU
control
addr
32 GPRs
+ PC ...
ALU
32-bit Reg
enALU
data
RegSel
ldMA
MA
addr
RegWrt
enReg
Memory
MemWrt
data
enMem
Bus 32
Microinstruction: register to register transfer (17 control signals)
MA PC
means
RegSel = PC; enReg=yes; ldMA= yes
B Reg[rf2] means
RegSel = rf2; enReg=yes; ldB = yes
Asanovic/Devadas
Spring 2002
6.823
Instruction Execution
Execution of a DLX instruction involves
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. memory operation (optional)
5. write back to register file (optional)
and the computation of the address of the
next instruction
Microprogram Fragments
instr fetch: MA PC
IR Memory
A PC
PC A + 4
dispatch on OPcode
Asanovic/Devadas
Spring 2002
6.823
can be
treated as
a macro
ALU:
A Reg[rf1]
B Reg[rf2]
Reg[rf3] func(A,B)
do instruction fetch
ALUi:
A Reg[rf1]
B Imm
sign extention ...
Reg[rf2] Opcode(A,B)
do instruction fetch
Microprogram Fragments
LW:
J:
beqz:
bz-taken:
(cont.)
A Reg[rf1]
B Imm
MA A + B
Reg[rf2] Memory
do instruction fetch
A PC
B Imm
PC A + B
do instruction fetch
A Reg[rf1]
If zero?(A) then go to bz-taken
do instruction fetch
A PC
B Imm
PC A + B
do instruction fetch
Asanovic/Devadas
Spring 2002
6.823
DLX Microcontroller:
first attempt
Opcode
zero?
Busy (memory)
PC (state)
latching the inputs
may cause a
one-cycle delay
s
addr
ROM Size ?
How big is s?
Asanovic/Devadas
Spring 2002
6.823
Program ROM
2(opcode+status+s) words
word = (control+s) bits
data
next state
17 Control Signals
Microprogram in the ROM
Asanovic/Devadas
Spring 2002
6.823
worksheet
State Op
zero?
busy
Control points
next-state
fetch0
fetch1
fetch1
fetch2
fetch3
*
*
*
*
*
*
*
*
*
*
*
yes
no
*
*
MA PC
....
IR Memory
A PC
PC A + 4
ALU0
ALU1
ALU2
*
*
*
*
*
*
*
*
*
A Reg[rf1]
ALU1
B Reg[rf2]
ALU2
Reg[rf3] func(A,B)
fetch0
fetch1
fetch1
fetch2
fetch3
?
Microprogram in the ROM
State Op
fetch0
fetch1
fetch1
fetch2
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
...
ALU0
ALU1
ALU2
zero?
busy
Control points
Asanovic/Devadas
Spring 2002
6.823
next-state
*
*
*
*
ALU
ALUi
LW
SW
J
JAL
JR
JALR
beqz
*
*
*
*
*
*
*
*
*
*
*
*
*
*
yes
no
*
*
*
*
*
*
*
*
*
*
MA PC
....
IR Memory
A PC
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
*
*
*
*
*
*
A Reg[rf1]
ALU1
B Reg[rf2]
ALU2
Reg[rf3] func(A,B)
fetch0
fetch1
fetch1
fetch2
fetch3
ALU0
ALUi0
LW0
SW0
J0
JAL0
JR0
JALR0
beqz0
Microprogram in the ROM
Asanovic/Devadas
Spring 2002
6.823
Cont.
State Op
ALUi0
ALUi1
ALUi1
ALUi2
...
J0
J1
J2
...
beqz0
beqz1
beqz1
beqz2
beqz3
...
zero?
busy
Control points
next-state
sExt
uExt
*
*
*
*
*
*
*
*
A Reg[rf1]
B sExt16(Imm)
B uExt16(Imm)
Reg[rf3] Op(A,B)
ALUi1
ALUi2
ALUi2
fetch0
*
*
*
A PC
B sExt26(Imm)
PC A+B
J1
J2
fetch0
*
yes
no
*
*
*
*
*
*
*
A Reg[rf1]
A PC
....
B sExt16(Imm)
PC A+B
beqz1
beqz
2
fetch0
beqz3
fetch0
Asanovic/Devadas
Spring 2002
6.823
Size of Control Store
w
/
status &
opcode
PC
addr
size = 2(w+s) x (c + s)
Control
ROM
data
Control signals / c
/ s
next
PC
DLX
w = 6+2
c = 17
s = ?
no. of steps per opcode = 4 to 6 + fetch-sequence
no. of states (4 steps per op-group ) x op-groups
+ common sequences
= 4 x 8 + 10 states = 42 states
s=6
Control ROM = 2(8+6) x 23 bits 48 Kbytes
Asanovic/Devadas
Spring 2002
6.823
Reducing Size of Control Store
Control store has to be fast expensive
Reduce the ROM height (= address bits)
reduce inputs by extra external logic
each input bit doubles the size of the
control store
reduce states by grouping opcodes
find common sequences of actions
condense input status bits
combine all exceptions into one, i.e.,
exception/no-exception
Reduce the ROM width
restrict the next-state encoding
Next, Dispatch on opcode, Wait for memory, ...
encode control signals (vertical microcode)
Asanovic/Devadas
Spring 2002
6.823
DLX Controller V2
Opcode
absolute
ext
op-group
Reduced ROM
height by
encoding inputs
PC
PC (state)
address
Control ROM
data
17 Control Signals
PC+1
+1
PCSrc
jump
logic
zero
busy
Reduce ROM
width by
encoding
next-state
JumpType
(next, spin, fetch,
dispatch, feqz, fnez )
Jump Logic
Asanovic/Devadas
Spring 2002
6.823
PCSrc = Case JumpTypes
next
PC+1
spin
if (busy) then PC else PC+1
fetch
absolute
dispatch
op-group
feqz
if (zero) then absolute else PC+1
fnez
if (zero) then PC+1 else absolute
Instruction Fetch & ALU:
DLX-Controller-2
State
Control points
next-state
fetch0
fetch1
fetch2
fetch3
...
ALU0
ALU1
ALU2
MA PC
IR Memory
A PC
PC A + 4
next
spin
next
dispatch
A Reg[rf1]
B Reg[rf2]
Reg[rf3] func(A,B)
next
next
fetch
ALUi0
ALUi1
ALUi2
A Reg[rf1]
B sExt16(Imm)
Reg[rf3] Op(A,B)
next
next
fetch
Asanovic/Devadas
Spring 2002
6.823
Load & Store:
DLX-Controller-2
State
Control points
next-state
LW0
LW1
LW2
LW3
LW4
A Reg[rf1]
B sExt16(Imm)
MA A+B
Reg[rf2] Memory
next
next
next
spin
fetch
SW0
SW1
SW2
SW3
SW4
A Reg[rf1]
B sExt16(Imm)
MA A+B
Memory Reg[rf2]
next
next
next
spin
fetch
Asanovic/Devadas
Spring 2002
6.823
Branches:
DLX-Controller-2
State
Control points
next-state
BEQZ0
BEQZ1
BEQZ2
BEQZ3
BEQZ4
A Reg[rf1]
next
fnez
next
next
fetch
BNEZ0
BNEZ1
BNEZ2
BNEZ3
BNEZ4
A Reg[rf1]
A PC
B sExt16(Imm)
PC A+B
A PC
B sExt16(Imm)
PC A+B
next
feqz
next
next
fetch
Asanovic/Devadas
Spring 2002
6.823
Jumps:
DLX-Controller-2
State
Control points
next-state
J0
J1
J2
A PC
B sExt26(Imm)
PC A+B
next
next
fetch
JR0
PC Reg[rf1]
fetch
JAL0
JAL1
JAL2
JAL3
A PC
Reg[31] A
B sExt26(Imm)
PC A+B
next
next
next
fetch
JALR0
JALR1
JALR2
A PC
Reg[31] A
PC Reg[rf1]
next
next
fetch
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
Microprogramming in IBM 360
M30
Datapath
width (bits)
inst width
(bits)
code size
(K insts)
store
technology
store
cycle (ns)
memory
cycle (ns)
Rental fee
($K/month)
M40
M50
M65
16
32
64
50
52
85
87
2.75
2.75
CCROS
TCROS
BCROS
BCROS
750
625
500
200
1500
2500
2000
750
15
35
Only fastest models (75 and 95) were hardwired
Horizontal vs Vertical Code
Asanovic/Devadas
Spring 2002
6.823
Bits per Instruction
# Instructions
Horizontal code has longer instructions
Can specify multiple parallel operations per instruction
Needs fewer steps to complete each macroinstruction
Sparser encoding more bits
Vertical code has more, narrower instructions
In limit, only single datapath operation per instruction
code branches require separate instruction
More steps to complete each macroinstruction
More compact less bits
Nanocoding
Tries to combine best of horizontal and vertical code
Nanocoding
Exploits recurring
control signal patterns
in code, e.g.,
ALU0 A Reg[rf1]
...
ALUi0 A Reg[rf1]
...
PC (state)
Asanovic/Devadas
Spring 2002
6.823
code
next-state
address
code ROM
nanoaddress
nanoinstruction ROM
data
17 Control Signals
MC68000 had 17-bit code containing either 10-bit
jump or 9-bit nanoinstruction pointer
Nanoinstructions were 68 bits wide, decoded to give
196 control signals
Asanovic/Devadas
Spring 2002
6.823
Implementing Complex Instructions
Opcode
ldIR
busy
zero?
ALUop ldA
32(PC)
31(Link)
rf3
rf2
rf1
ldB
RegSel
IR
ExSel
enImm
Imm
Ext
32 GPRs
+ PC ...
/
ALU
32-bit Reg
enALU
data
MA
addr
addr
ldMA
RegWrt
enReg
Memory
MemWrt
enMem
data
Bus
rf3 M[(rf1)] op (rf2)
M[(rf3)] (rf1) op (rf2)
M[(rf3)] M[(rf1)] op M[(rf2)]
Reg-Memory-src ALU op
Reg-Memory-dst ALU op
Mem-Mem ALU op
Mem-Mem ALU Instructions:
Asanovic/Devadas
Spring 2002
6.823
DLX-Controller-2
Mem-Mem ALU op
ALUMM0
ALUMM1
ALUMM2
ALUMM3
ALUMM4
ALUMM5
ALUMM6
M[(rf3)] M[(rf1)] op M[(rf2)]
MA Reg[rf1]
A Memory
MA Reg[rf2]
B Memory
MA Reg[rf3]
Memory func(A,B)
next
spin
next
spin
next
spin
fetch
Complex instructions usually do not require datapath
modifications in a microprogrammed implementation
-- only extra space for the control program
Implementing these instructions using a hardwired controller is
difficult without datapath modifications
Microcode Emulation
Asanovic/Devadas
Spring 2002
6.823
IBM initially miscalculated importance of
software compatibility when introducing 360
series
Honeywell started effort to steal IBM 1401
customers by offering translation software
(Liberator) for Honeywell H200 series machine
IBM retaliates with optional additional microcode
for 360 series that can emulate IBM 1401 ISA, later
extended for IBM 7000 series
one popular program on 1401 was a 650 simulator, so some
customers ran many 650 programs on emulated 1401s (650>1401->360)
Microprogramming in the
Seventies
Asanovic/Devadas
Spring 2002
6.823
Thrived because:
Significantly faster ROMs than DRAMs were available
For complex instruction sets, datapath and controller
were cheaper and simpler
New instructions , e.g., floating point, could be
supported without datapath modifications
Fixing bugs in the controller was easier
ISA compatibility across various models could be
achieved easily and cheaply
Except for cheapest and fastest machines, all computers
were microprogrammed
Asanovic/Devadas
Spring 2002
6.823
Writable Control Store (WCS)
Implement control store with SRAM not ROM
MOS SRAM memories now almost as fast as control store (core
memories/DRAMs were 10x slower)
Bug-free microprograms difficult to write
User-WCS provided as option on several minicomputers
Allowed users to change microcode for each process
User-WCS failed
Little or no programming tools support
Hard to fit software into small space
Microcode control tailored to original ISA, less useful for others
Large WCS part of processor state - expensive context switches
Protection difficult if user can change microcode
Virtual memory required restartable microcode
Performance Issues
Asanovic/Devadas
Spring 2002
6.823
Microprogrammed control
multiple cycles per instruction
Cycle time ?
tC > max(treg-reg, tALU, tROM, tRAM)
Given complex control, tALU & tRAM can be broken
into multiple cycles. However, tROM cannot be
broken down. Hence
tC > max(treg-reg, tROM)
Suppose 10 * tROM < tRAM
good performance, relative to the single-cycle
hardwired implementation, can be achieved
even with a CPI of 10
VLSI & Microprogramming
Asanovic/Devadas
Spring 2002
6.823
By late seventies
technology assumption about ROM & RAM
speed became invalid
micromachines became more complicated
to overcome slower ROM, micromachines
were pipelined
complex instruction sets led to the need for
subroutine and call stacks in code.
need for fixing bugs in control programs
was in conflict with read-only nature of ROM
WCS (B1700, QMachine, Intel432, )
introduction of caches and buffers, especially for
instructions, made multiple-cycle execution of
reg-reg instructions unattractive
Modern Usage
Asanovic/Devadas
Spring 2002
6.823
Microprogramming is far from extinct
Played a crucial role in micros of the Eighties,
Motorola 68K series
Intel 386 and 486
Microcode is present in most modern CISC micros in an
assisting role (e.g. AMD Athlon, Intel Pentium-4)
Most instructions are executed directly, i.e., with
hard-wired control
Infrequently-used and/or complicated instructions
invoke the microcode engine
Patchable microcode common for post-fabrication bug
fixes, e.g. Intel Pentiums load code patches at bootup
Asanovic/Devadas
Spring 2002
6.823
Simple Instruction
Pipelining
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Processor Performance
Equation
Asanovic/Devadas
Spring 2002
6.823
Time = Instructions * Cycles
* Time
Program
Program
Instruction Cycle
Instructions per program depends on source code,
compiler technology, and ISA
Microcoded DLX from last lecture had cycles per
instruction (CPI) of around 7 minimum
Time per cycle for microcoded DLX fixed by
microcode cycle time
mostly ROM access + next PC select logic
Pipelined DLX
To pipeline DLX:
First build unpipelined DLX with CPI=1
Next, add pipeline registers to reduce
cycle time while maintaining CPI=1
Asanovic/Devadas
Spring 2002
6.823
A Simple Memory Model
Asanovic/Devadas
Spring 2002
6.823
WriteEnable
Clock
Address
WriteData
MAGIC
RAM
ReadData
Reads and writes are always completed in one cycle
a Read can be done any time (i.e. combinational)
a Write is performed at the rising clock edge
if it is enabled
the write address, data, and enable
must be stable at the clock edge
Asanovic/Devadas
Spring 2002
6.823
Datapath for ALU Instructions
RegWrite
0x4
clk
Add
PC
clk
addr
we
rs1
rs2
rd1
ws
wd rd2
inst<25:21>
inst<20:16>
inst
Inst.
Memory
GPRs
inst<15:11>
inst<15:0>
6
0
5
rf1
5
rf2
opcode
rf1
rf2
RegDst
rf2 / rf3
5
rf3
Imm
Ext
inst<31:26> <5:0>
OpCode
ALU
ALU
Control
ExtSel
5
0
OpSel
6
func
immediate
BSrc
Reg / Imm
rf3 (rf1) func (rf2)
rf2 (rf1) op immediate
Datapath for Memory
Instructions
Asanovic/Devadas
Spring 2002
6.823
Should program and data memory be separate?
Harvard style: separate (Aiken and Mark 1 influence)
- read-only program memory
- read/write data memory
at some level the two memories have
to be the same
Princeton style: the same (von Neumanns influence)
- A Load or Store instruction requires
accessing the memory more than once
during its execution
Load/Store Instructions:
Asanovic/Devadas
Spring 2002
6.823
Harvard-Style Datapath
RegWrite
0x4
we
rs1
rs2
rd1
ws
wd rd2
base
addr
inst
Inst.
Memory
clk
WBSrc
ALU / Mem
clk
Add
PC
MemWrite
clk
ALU
GPRs
disp
we
addr
z
rdata
Data
Memory
Imm
Ext
wdata
ALU
Control
OpCode RegDst
6
opcode
31
26 25
5
rf1
ExtSel
5
rf2
21 20
16 15
OpSel
BSrc
16
displacement
addressing mode
(rf1) + displacement
0
rf1 is the base register
rf2 is the destination of a Load or the source for a Store
Memory Hierarchy c.2002
Asanovic/Devadas
Spring 2002
6.823
Desktop & Small Server
On-chip Caches
I$
Proc
D$
L2
0.5-2ns <10ns
2-3 clk 5~15 clk
8~64KB 0.25-2MB
Off-chip
L3 Cache
SRAM/
eDRAM
< 25ns
15~50 clk
1~8MB
Hard Disk
Interleaved
Banks of DRAM
~150ns
~10ms seek time
100~300 clk
~107 clk
64M~1GB
20~100GB
Our memory model is a good approximation of
the hierarchical memory system when we hit in
the on-chip cache
Conditional Branches
PCSrc ( ~j / j )
RegWrite
MemWrite
0x4
Add
Add
clk
PC
addr
inst
clk
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
clk
ALU
Imm
Ext
ALU
Control
OpCode RegDst
ExtSel
OpSel BSrc
zero?
we
addr
rdata
Data
Memory
wdata
Asanovic/Devadas
Spring 2002
6.823
WBSrc
Register-Indirect Jumps
PCSrc ( ~j / j RInd / j PCR )
0x4
RegWrite
MemWrite
Add
Add
clk
PC
clk
addr
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
z
GPRs
Imm
Ext
wdata
ALU
Control
Jump & Link?
OpCode RegDst
ExtSel
rdata
Data
Memory
OpSel BSrc
zero?
Asanovic/Devadas
Spring 2002
6.823
WBSrc
Jump & Link
RegWrite
PCSrc
Asanovic/Devadas
Spring 2002
6.823
MemWrite
0x4
Add
Add
clk
PC
clk
addr
inst
Inst.
Memory
31
we
rs1
rs2
rd1
ws
wd rd2
clk
ALU
GPRs
we
addr
Imm
Ext
wdata
ALU
Control
OpCode
RegDst ExtSel
rf3 / rf2 / R31
rdata
Data
Memory
OpSel BSrc
zero?
WBSrc
ALU / Mem / PC
PC-Relative Jumps
RegWrite
PCSrc
Asanovic/Devadas
Spring 2002
6.823
MemWrite
0x4
Add
Add
clk
PC
clk
addr
inst
31
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
ALU
GPRs
we
addr
z
Imm
Ext
wdata
ALU
Control
No new
datapath
required
OpCode RegDst
rdata
Data
Memory
ExtSel OpSel
BSrc
Ext16 / Ext26
zero?
WBSrc
Hardwired Control is pure
Combinational Logic:
Unpipelined DLX
op code
zero?
combinational
logic
ExtSel
BSrc
OpSel
MemWrite
WBSrc
RegDst
RegWrite
PCSrc
Asanovic/Devadas
Spring 2002
6.823
ALU Control & Immediate
Extension
Asanovic/Devadas
Spring 2002
6.823
Inst<5:0> (Func)
Inst<31:26> (Opcode)
ALUop
+
0?
OpSel
( Func, Op, +, 0? )
Decode Map
ExtSel
( sExt16, uExt16,
sExt26, High16)
Hardwired Control
PCSrc
PCR / RInd / ~j
0x4
worksheet
RegWrite
MemWrite
Add
Add
0x4
Add
clk
inst<25:21>
inst<20:16>
PC
addr
clk
Inst.
Memory
31
inst
inst<15:11>
inst<25:0>
inst<31:26><5:0>
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
z
GPRs
Imm
Ext
rdata
Data
Memory
wdata
ALU
Control
BSrc
Reg / Imm
OpCode
RegDst
rf2 / rf3 /
R31
Asanovic/Devadas
Spring 2002
6.823
ExtSel
sExt16/uExt16/
sExt26/High16
OpSel
Func/
Op/+ / 0?
zero?
WBSrc
ALU / Mem
/ PC
Hardwired Control Table
BSrc = Reg / Imm
PCSrc1 = j / ~j
Opcode
WBSrc = ALU / Mem / PC
PCSrc2 = PCR / RInd
Ext
Sel
BSrc Op
Sel
Mem
Write
Reg
Write
Asanovic/Devadas
Spring 2002
6.823
RegDst = rf2 / rf3 / R31
WB
Src
Reg
Dest
PC
Src
ALU
Reg
Func
no
yes
ALU
rf3
~j
ALUu
Reg
Func
no
yes
ALU
rf3
~j
ALUi
sExt16
Imm
Op
no
yes
ALU
rf2
~j
ALUiu
uExt16
Imm
Op
no
yes
ALU
rf2
~j
LW
sExt16
Imm
no
yes
Mem
rf2
~j
SW
sExt16
Imm
yes
no
~j
BEQZz=0
sExt16
0?
no
no
PCR
BEQZz=1
sExt16
0?
no
no
~j
sExt26
no
no
PCR
JAL
sExt26
no
yes
PC
R31
PCR
JR
no
no
RInd
JALR
no
yes
PC
R31
RInd
Asanovic/Devadas
Spring 2002
6.823
Hardwired Unpipelined Machine
Simple
One instruction per cycle
Why wasnt this a popular machine style?
Unpipelined DLX
Asanovic/Devadas
Spring 2002
6.823
Clock period must be sufficiently long for all of the
following steps to be completed:
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. data fetch if required
5. register write-back setup time
tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
At the rising edge of the following clock, the
PC, the register file and the memory are updated
Pipelined DLX Datapath
Asanovic/Devadas
Spring 2002
6.823
0x4
Add
PC
addr
rdata
Inst.
Memory
fetch
phase
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
rdata
Data
Memory
Imm
Ext
decode & Reg-fetch
phase
we
addr
wdata
execute
phase
memory
phase
write
-back
phase
Clock period can be reduced by dividing the execution
of an instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} = tDM (probably)
However, CPI will increase unless instructions
are pipelined
An Ideal Pipeline
stage
1
stage
2
stage
3
Asanovic/Devadas
Spring 2002
6.823
stage
4
All objects go through the same stages
No sharing of resources between any two stages
Propagation delay through all pipeline stages is equal
The scheduling of an object entering the pipeline is
not affected by the objects in other stages
These conditions generally hold for industrial
assembly lines. An instruction pipeline, however,
cannot satisfy the last condition. Why?
Pipelining History
Asanovic/Devadas
Spring 2002
6.823
Some very early machines had limited pipelined
execution (e.g., Zuse Z4, WWII)
Usually overlap fetch of next instruction with current execution
IBM Stretch first major supercomputer
incorporating extensive pipelining, result
bypassing, and branch prediction
project started in 1954, delivered in 1961
didnt meet initial performance goal of 100x faster with 10x
faster circuits
up to 11 macroinstructions in pipeline at same time
microcode engine highly pipelined also (up to 6
microinstructions in pipeline at same time)
Stretch was origin of 8-bit byte and lower case characters,
carried on into IBM 360
Asanovic/Devadas
Spring 2002
6.823
How to divide the datapath
into stages
Suppose memory is significantly slower than other
stages. In particular, suppose
tIM = tDM = 10 units
tALU = 5 units
tRF = tRW = 1 unit
Since the slowest stage determines the clock, it may
be possible to combine some stages without any loss
of performance
Minimizing Critical Path
Asanovic/Devadas
Spring 2002
6.823
0 x4
Add
PC
addr
rdata
Inst.
Memory
fetch
phase
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
Imm
Ext
decode & Reg-fetch & execute
phase
we
addr
rdata
Data
Memory
wdata
memory
phase
write
-back
phase
tC > max {tIM, tRF + tALU, tDM, tRW}
Write-back stage takes much less time than other stages.
Suppose we combined it with the memory phase
increase the critical path by 10%
Maximum Speedup by
Pipelining
Asanovic/Devadas
Spring 2002
6.823
For the 4-stage pipeline, given
tIM = tDM = 10 units, tALU = 5 units, tRF = tRW= 1 unit
tC could be reduced from 27 units to 10 units
speedup = 2.7
However, if tIM = tDM = tALU = tRF = tRW = 5 units
The same 4-stage pipeline can reduce tC from 25 units to
10 units
speedup = 2.5
But, since tIM = tDM = tALU = tRF = tRW, it is possible to
achieve higher speedup with more stages in the pipeline.
A 5-stage pipeline can reduce tC from 25 units to
5 units
speedup = 5
Technology Assumptions
Asanovic/Devadas
Spring 2002
6.823
We will assume
A small amount of very fast memory (caches)
backed up by a large, slower memory
Fast ALU (at least for integers)
Multiported Register files (slower!).
It makes the following timing assumption valid
tIM tRF tALU tDM tRW
A 5-stage pipelined Harvard-style architecture will
be the focus of our detailed design
5-Stage Pipelined Execution
Asanovic/Devadas
Spring 2002
6.823
0x4
Add
PC
we
addr
we
rs1
rs2
rd1
ws
wd rd2
GPRs
IR
rdata
Memory
wdata
fetch
phase
(IF)
ALU
rdata
Memory
Imm
Ext
decode & Reg-fetch
phase
(ID)
time
instruction1
instruction2
instruction3
instruction4
instruction5
t0
IF1
t1 t2
ID1 EX1
IF2 ID2
IF3
we
addr
wdata
execute
phase
(EX)
t3
MA1
EX2
ID3
IF4
memory
phase
(MA)
t4 t5 t6 t7 . . . .
WB1
MA2 WB2
EX3 MA3 WB3
ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
write
-back
phase
(WB)
5-Stage Pipelined Execution
Resource Usage Diagram
Asanovic/Devadas
Spring 2002
6.823
0x4
Add
we
addr
IR
rdata
Memory
wdata
fetch
phase
(IF)
Resources
PC
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
rdata
Memory
Imm
Ext
wdata
decode & Reg-fetch
phase
(ID)
time
IF
ID
EX
MA
WB
t0
I1
t1
I2
I1
we
addr
t2
I3
I2
I1
execute
phase
(EX)
t3
I4
I3
I2
I1
t4
I5
I4
I3
I2
I1
write
-back
phase
(WB)
memory
phase
(MA)
t5
t6
t7
....
I5
I4
I3
I2
I5
I4
I3
I5
I4
I5
Pipelined Execution:
ALU Instructions
not quite correct!
0x4
Add
PC
addr
inst IR
Inst
Memory
Asanovic/Devadas
Spring 2002
6.823
31
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
MD1
MD2
Pipelined Execution:
Need for Several IRs
0x4
IR
Add
PC
addr
inst IR
Inst
Memory
Asanovic/Devadas
Spring 2002
6.823
IR
IR
31
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
MD1
MD2
Asanovic/Devadas
Spring 2002
6.823
IRs and Control points
0x4
IR
Add
PC
addr
inst IR
Inst
Memory
IR
IR
31
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
MD1
MD2
Are control points connected properly?
- Load/Store instructions
- ALU instructions
Pipelined DLX Datapath
Asanovic/Devadas
Spring 2002
6.823
without jumps
IR
IR
IR
31
0x4
Add
RegDst
RegWrite
PC
addr
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
OpSel
MemWrite
WBSrc
A
ALU
GPRs
rdata
Data
Memory
Imm
Ext
wdata
MD1
ExtSel
we
addr
BSrc
MD2
Asanovic/Devadas
Spring 2002
6.823
How Instructions can Interact
with each other in a Pipeline
An instruction in the pipeline may need a resource
being used by another instruction in the pipeline
structural hazard
An instruction may produce data that is needed by
a later instruction
data hazard
In the extreme case, an instruction may determine
the next instruction to be executed
control hazard (branches, interrupts,...)
Asanovic/Devadas
Spring 2002
6.823
Feedback to Resolve Hazards
FB1
FB3
FB2
stage
1
stage
2
FB4
stage
3
stage
4
Controlling pipeline in this manner works provided
the instruction at stage i+1 can complete without
any interference from instructions in stages 1 to i
(otherwise deadlocks may occur)
Feedback to previous stages is used to stall or kill
instructions
Asanovic/Devadas
Spring 2002
6.823
Pipeline Hazards
Krste Asanovic
Laboratory for Computer Science
M.I.T.
Pipelined DLX Datapath
Asanovic/Devadas
Spring 2002
6.823
without interlocks and jumps
IR
IR
IR
31
0x4
Add
RegDst
RegWrite
PC
addr
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
OpSel
MemWrite
WBSrc
A
ALU
GPRs
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
ExtSel
we
addr
BSrc
MD2
Asanovic/Devadas
Spring 2002
6.823
Data Hazards
...
r1 r0 + 10
r4 r1 + 17
...
r4 r1
0x4
IR
Add
addr
inst IR
Inst
Memory
IR
IR
31
D
PC
r1
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
MD2
Oops!
Resolving Data Hazards
Asanovic/Devadas
Spring 2002
6.823
1. Interlocks
Freeze earlier pipeline stages until data becomes
available
2. Bypasses
If data is available somewhere in the datapath
provide a bypass to get it to the right stage
Interlocks to resolve Data
Hazards
Asanovic/Devadas
Spring 2002
6.823
Stall Condition
0x4
nop
Add
PC
addr
inst IR
Inst
D
Memory
IR
IR
IR
31
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
B
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
...
r1 r0 + 10
r4 r1 + 17
...
we
addr
MD2
Stalled Stages and
Pipeline Bubbles
time
t0
t1
IF1 ID1
IF2
(I1) r1 r0 + 10
(I2) r4 r1 + 17
(I3)
(I4)
(I5)
Resource
Usage
IF
ID
EX
MA
WB
time
t0
t1
I1
I2
I1
t2
t3
t4
t5
EX1 MA1 WB1
ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages
t6
t2
I3
I2
I1
t6
I4
I3
I2
nop
nop
t3
I3
I2
nop
I1
t4
I3
I2
nop
nop
I1
t5
I3
I2
nop
nop
nop
t7
Asanovic/Devadas
Spring 2002
6.823
....
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
nop
t7
I5
I4
I3
I2
nop
....
I5
I4
I3
I2
I5
I4
I3
pipeline bubble
I5
I4
I5
Interlock Control Logic
Asanovic/Devadas
Spring 2002
6.823
worksheet
ws
stall
Cstall
rf1 ?
rf2
0x4
nop
Add
PC
addr
inst IR
Inst
D
Memory
IR
IR
IR
ws
we
rs1
rs2
rd1
ws
wd rd2
GPRs
we
A
ALU
31
Cdest
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
MD2
Compare the source registers of the instruction in
the decode stage with the destination register of the
uncommitted instructions.
Interlock Control Logic
Asanovic/Devadas
Spring 2002
6.823
ignoring jumps & branches
wsW
weW
Cstall wsM
weM
rf1
wsE
rf2 weE
stall
re1
we
re2
Cre
0x4
nop
Add
PC
addr
inst IR
Inst
D
Memory
ws
ws we
Cdest
IR
Cdest
IR
IR
ws
we
rs1
rs2
rd1
ws
wd rd2
GPRs
we
A
ALU
31
Cdest
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
MD2
Should we always stall if the rs field matches some rd?
not every instruction writes registers we
not every instruction reads registers re
Asanovic/Devadas
Spring 2002
6.823
Source & Destination Registers
R-type:
op
rf1
rf2
rf3
func
I-type:
op
rf1
rf2
immediate16
J-type:
op
immediate26
source(s)
rf1, rf2
rf1
rf1
rf1, rf2
rf3 rf1 func rf2
rf2 rf1 op imm
rf2 M [rf1 + imm]
M [rf1 + imm] rf2
cond rf1
true: PC PC + 4 + imm
false: PC PC + 4
J
PC PC + 4 + imm
JAL
r31 PC, PC PC + 4 + imm
JR
PC rf1
JALR r31 PC, PC rf1
ALU
ALUi
LW
SW
B__Z
destination
rf3
rf2
rf2
rf1
rf1
S
rf1
rf1
31
31
Deriving the Stall Signal
Cdest
ws = Case opcode
ALU
rf3
ALUi, LW
rf2
JAL, JALR
31
Cre
re1 = Case opcode
ALU, ALUi,
on
off
re2 = Case opcode
we = Case opcode
ALU, ALUi, LW,
JAL, JALR
on
...
off
Asanovic/Devadas
Spring 2002
6.823
on
off
stall =
Stall if the source registers of the instruction in the decode stage
matches the destination register of the uncommitted instructions.
The Stall Signal
Cdest
ws = Case opcode
ALU
rf3
ALUi, LW
rf2
JAL, JALR
31
Cre
we = Case opcode
ALU, ALUi, LW,
JAL, JALR
(ws 0)
...
off
Cstall
stall = ( (rf1D =wsE).weE +
(rf1D =wsM).weM +
(rf1D =wsW).weW ) . re1D
((rf2D =wsE).weE +
(rf2D =wsM).weM +
(rf2D =wsW).weW ) . re2D
Asanovic/Devadas
Spring 2002
6.823
re1 = Case opcode
ALU, ALUi, LW,
SW, BZ,
JR, JALR
on
J, JAL
off
re2 = Case opcode
ALU, SW
on
...
off
This is not
the full story !
Hazards due to Loads &
Stores
Asanovic/Devadas
Spring 2002
6.823
Stall Condition
0x4
nop
Add
PC
addr
inst IR
Inst
D
Memory
IR
IR
IR
31
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
B
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
...
M[r1+7] r2
r4 M[r3+5]
...
we
addr
MD2
Is there any possible data hazard
in this instruction sequence?
Hazards due to Loads & Stores
Asanovic/Devadas
Spring 2002
6.823
depends on the memory system
?
0x4
nop
Add
PC
addr
inst IR
Inst
D
Memory
IR
IR
IR
31
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
B
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
M[r1+7] r2
r4 M[r3+5]
...
we
addr
MD2
r1+7 = r3+5 data hazard
However, the hazard is avoided because
our memory system completes writes in
one cycle !
Asanovic/Devadas
Spring 2002
6.823
VAX: A Complex Instruction Set
Addressing Mode
Syntax
Length in bytes
Literal
Immediate
Register
Register deferred
Byte/word/long
displacement
Byte/word/long
displacement
deferred
Scaled (Indexed)
Autoincrement
Autodecrement
Autoincrement
deferred
#value
#value
Rn
(Rn)
disp(Rn)
1 (6-bit signed value)
1 + |immediate|
1
1
1 + |displacement|
@disp(Rn)
1 + |displacement|
base mode (Rx)
(Rn)+
-(Rn)
@(Rn)+
1 + |base addressing mode|
1
1
1
ADDL3 R1, 737 (R2), #456
1
+ 1 + ( 2 + 1 ) + (1+ 4) = 10 bytes !
R1 <-- M[(R2)+737] + 456
VAX Extremes
Asanovic/Devadas
Spring 2002
6.823
Long instructions, e.g.,
emodh #1.2334, 70000(r1)[r2],
#2.3456, 80000(r1)[r2],
90000(r1)[r2]
54 byte encoding (rumors of even longer instruction
sequences...)
Memory referencing behavior
In worst case, a single instruction could require at least 41
different memory pages to be resident in memory to complete
execution
Doesnt include string instructions which were designed to be
interruptable
Reduced Instruction Set
Computers
Asanovic/Devadas
Spring 2002
6.823
(Cocke, IBM; Patterson, UC Berkeley; Hennessy, Stanford)
Compilers have difficulty using complex instructions
VAX: 60% of microcode for 20% of instructions, only responsible for 0.2%
execution time
IBM experiment retargets 370 compiler to use simple subset of ISA
=> Compiler generated faster code!
Simple instruction sets dont need microcode
Use fast memory near processor as cache, not microcode storage
Design ISA for simple pipelined implementation
Fixed length, fixed format instructions
Load/store architecture with up to one memory access/instruction
Few addressing modes, synthesize others with code sequence
Register-register ALU operations
Delayed branch
MIPS R2000
Asanovic/Devadas
Spring 2002
6.823
(One of first commercial RISCs, 1986)
Load/Store architecture
32x32-bit GPR (R0 is wired), HI & LO SPR (for multiply/divide)
74 instructions
Fixed instruction size (32 bits), only 3 formats
PC-relative branches, register indirect jumps
Only base+displacement addressing mode
No condition bits, compares write GPRs, branches test GPRs
Delayed loads and branches
Five-stage instruction pipeline
Fetch, Decode, Execute, Memory, Write Back
CPI of 1 for register-to-register ALU instructions
8 MHz clock
Tightly-coupled off-chip FP accelerator (R2010)
RISC/CISC Comparisons
Asanovic/Devadas
Spring 2002
6.823
(late 80s/early 90s)
Time = Instructions * Cycles
* Time
Program
Program
Instruction Cycle
R2000 vs VAX 8700 [Bhandarkar and Clark, 91]
R2000 has ~2.7x advantage with equivalent technology
Intel 80486 vs Intel i860 (both 1989)
Same company, same CAD tools, same process
i860 2-4x faster - even more on some floating-point tasks
DEC nVAX vs Alpha 21064 (both 1992)
Same company, same CAD tools, same process
Alpha 2-4x faster
Complications due to Jumps
PCSrc2
( PCR / RInd)
PCSrc1
( j / ~j )
Add
I1
I2
I3
I4
PC
addr
104
Inst
Memory
096
100
104
304
stall
Add
for
register
indirect
jumps
0x4
inst
ADD
J +200
ADD
SUB
Asanovic/Devadas
Spring 2002
6.823
nop
Jump?
IR
IR
I1
IR
I2
kill
A jump instruction kills (not stalls)
the following instruction
How?
Asanovic/Devadas
Spring 2002
6.823
Pipelining Jumps
PCSrc2
( PCR / RInd)
PCSrc1
( j / ~j )
stall
304
Add
nop
0x4
Add
Jump?
IR
IR
I1
IRSrcD
I1
I2
I3
I4
PC
addr
104
Inst
Memory
096
100
104
304
inst
nop
ADD
J +200
ADD
SUB
IR
I2
Killing the fetched instruction:
Insert a mux before IR
kill
IRSrcD = Case opcodeD
J, JAL
nop
...
IM
Any interaction between stall and jump?
Asanovic/Devadas
Spring 2002
6.823
Jump Pipeline Diagrams
I1 096
I2 100
I3 104
I4 304
Resource
Usage
ADD
J +200
ADD kill
SUB
IF
ID
EX
MA
WB
time
t0
t1
IF1 ID1
IF2
time
t0
t1
I1
I2
I1
t2
I3
I2
I1
t2
EX1
ID2
IF3
t3
I4
nop
I2
I1
t3
MA1
EX2
nop
IF4
t4
WB1
MA2
nop
ID4
t4
t5
t5
t6
t7
....
WB2
nop nop
EX4 MA4 WB4
t6
t7
....
I4
nop I4
I2
nop I4
I1
I2
nop I4
nop
pipeline bubble
Pipelining Conditional Branches
PCSrc2
( PCR / RInd)
PCSrc1
( j / ~j )
stall
Add
nop
0x4
Add
BEQZ?
IR
IR
I1
zero?
IRSrcD
I1
I2
I3
I4
PC
addr
104
Inst
Memory
096
100
104
304
inst
nop
ADD
BEQZ r1, +200
ADD
ADD
Asanovic/Devadas
Spring 2002
6.823
IR
I2
A
ALU
Branch condition is not known until
the execute stage
what action should be taken in the
decode stage ?
Conditional Branches:
PCSrc2
PCSrc1
( jD/ jE / ~j ) ( PCR / RInd)
Asanovic/Devadas
Spring 2002
6.823
solution 1
Add
stall
Add
nop
0x4
Add
BEQZ?
E
IR
IR
I2
I1
IRSrcD
?
I1
I2
I3
I4
096
100
104
304
PC
addr
108
Inst
Memory
inst
nop
ADD
BEQZ r1, +200
ADD
ADD
zero?
PC
IR
I3
A
ALU
If the branch is taken
- kill the two following instructions
- the instruction at the decode
stage is not valid
stall signal is not valid
New Stall Signal
Asanovic/Devadas
Spring 2002
6.823
stall = (((rf1D =wsE).weE + (rf1D =wsM).weM + (rf1D =wsW).weW).re1D
+ ((rf2D =wsE).weE + (rf2D =wsM).weM + (rf2D =wsW).weW).re2D)
. !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z)
Control Equations for PC Muxes
Asanovic/Devadas
Spring 2002
6.823
Solution 1
PCSrc1 = Case opcodeE
BEQZ.z, BNEZ.!z jE
...
Case opcodeD
J, JAL, JR, JALR jD
...
~j
PCSrc2 = Case opcodeD
J, JAL
PCR
JR, JALR
RegI
IRSrcD = Case opcodeE
BEQZ.z, BNEZ.!z nop
...
(Case opcodeD
J, JAL, JR, JALR nop
...
IM)
IRSrcE = Case opcodeE
BEQZ.z, BNEZ.!z nop
...
stall.nop + !stall.IRD
Give priority
to the older
instruction,
i.e., execute
stage instruction
over decode
stage instruction
Conditional Branches:
solution 2
Test
for zero at the decode stage
PCSrc2
PCSrc1
( j / ~j )
Asanovic/Devadas
Spring 2002
6.823
stall
( PCR / RInd)
304
304
Add
108
BEQZ?
nop
0x4
Add
IR
IR
I1
IRSrcD
I1
I2
I3
I4
PC
addr
104
Inst
Memory
096
100
104
304
inst
nop
ADD
BEQZ r1, +200
ADD
ADD
zero?
or
IR
I2
A
ALU
Need to kill only one instruction !
Wouldnt work if DLX had general
branch conditions (i.e., r1>r2)?
Conditional Branches:
Delayed Branches
PCSrc2
( PCR / RInd)
PCSrc1
( j / ~j )
solution 3
Asanovic/Devadas
Spring 2002
6.823
stall
304
304
Add
108
BEQZ?
nop
0x4
Add
IR
IR
I1
zero?
I1
I2
I3
I4
096
100
104
304
PC
addr
inst
IR
104
Inst
Memory
I2
A
ALU
ADD
Change the semantics of branches and jumps
BEQZ r1, +200
Instruction after branch always executed,
ADD (delay slot)
regardless if branch taken or not taken.
ADD
Need not kill any instructions !
Annulling Branches
Asanovic/Devadas
Spring 2002
6.823
Plain delayed branches only allow compiler to fill delay slot with
instructions that will always be executed:
ADD R1, R2, R3
BEQZ R3, target
BEQZ R3, target
NOP (delay slot)
ADD R1, R2, R3 (delay slot)
Annulling branches kill delay slot if branch is not taken. Compiler
can therefore fill delay slot with copy of branch target:
ADD R1, R2, R3
ADD R1, R2, R3
BEQZ R1, target
BEQZL R1, target+4 (annulling)
NOP (delay slot)
ADDI R2, R2, #1 (delay slot)
...
...
target:
target:
ADDI R2, R2, #1
ADDI R2, R2, #1
SUB R5, R6, R8
SUB R5, R6, R8
Can also have variant of annulling branch that kills delay slot if
branch is taken
Branch Delay Slots
Asanovic/Devadas
Spring 2002
6.823
First introduced in pipelined microcode engines.
Adopted for early RISC machines with single-issue pipelines, made
visible to user-level software.
Advantages
simplifies control logic for simple pipeline
compiler helps reduce branch hazard penalties
~70% of single delay slots usefully filled
Disadvantages
complicates ISA specification and programming
adds extra next-PC state to programming model
complicates control logic for more aggressive implementations
e.g., out-of-order superscalar designs
Out of favor for new (post-1990) general-purpose ISAs
Later lectures will cover advanced techniques for control hazards:
dynamic (run-time) schemes, e.g., branch prediction
static (compile-time) schemes, e.g., predicated execution
Pipelining Delayed Jumps &
Links
Asanovic/Devadas
Spring 2002
6.823
RA
304
304
108
Add
0x4
RA
holding the return address
for linking
Add
IR
PC
104
I1
I2
I3
I4
addr
inst IR
Inst
Memory I2
096
100
104
304
we
rs1
rs2
rd1
ws
wd rd2
GPRs
31
I1
A
ALU
we
addr
rdata
Data
Memory
Imm
Ext
ADD
JALR +200
ADD (delay slot)
ADD
IR
IR
wdata
MD1
MD2
Asanovic/Devadas
Spring 2002
6.823
Bypassing
time
(I1) r1 r0 + 10
(I2) r4 r1 + 17
(I3)
(I4)
(I5)
t0
IF1
t1
ID1
IF2
t2
t3
t4
t5
EX1 MA1 WB1
ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages
t6
t7
....
EX2 MA2 WB2
ID3 EX3 MA3
IF4 ID4 EX4
IF5 ID5
Each stall or kill introduces a bubble in the pipeline
CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time
(I1) r1 r0 + 10
(I2) r4 r1 + 17
(I3)
(I4)
(I5)
t0
IF1
t1
ID1
IF2
t2
EX1
ID2
IF3
t3
MA1
EX2
ID3
IF4
t4
WB1
MA2
EX3
ID4
IF5
t5
t6
t7
....
WB2
MA3 WB3
EX4 MA4 WB4
ID5 EX5 MA5 WB5
Asanovic/Devadas
Spring 2002
6.823
Adding Bypasses
Stall
r1 ...
r4 r1...
0x4
nop
Add
IR
IR
IR
31
ASrc
PC
addr
inst IR
Inst
D
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
B
(I1)
(I2)
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
...
r1 r0 + 10
r4 r1 + 17
...
MD2
Of course you can add many
more bypasses!
The Bypass Signal
Asanovic/Devadas
Spring 2002
6.823
deriving it from the stall signal
Cdest
ws = Case opcode
ALU
rf3
ALUi, LW
rf2
JAL, JALR
R31
we = Case opcode
ALU, ALUi, LW (ws R0)
JAL, JALR
on
...
off
Cstall
Cre
stall = ( (rf1D =wsE).weE +
(rf1D =wsM).weM +
(rf1D =wsW).weW ) . re1D
((rf2D =wsE).weE +
(rf2D =wsM).weM +
(rf2D =wsW).weW ) . re2D
Cbypass ASrc = (rf1D =wsE).weE. re1D
re1 = Case opcode
ALU, ALUi, LW,
SW, BZ,
JR, JALR
on
J, JAL
off
re2 = Case opcode
ALU, SW
on
...
off
Is this correct ?
Usefulness of a Bypass
Asanovic/Devadas
Spring 2002
6.823
Stall
0x4
nop
Add
I1
IR
IR
IR
31
ASrc
PC
addr
inst IR
Inst
I
Memory 2
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
rdata
Data
Memory
Imm
Ext
MD1
Consider
...
...
r1 M[r0 + 10]
r1 r0 + 10
(I1)
r4 r1 + 17
(I2)
r4 r1 + 17
Where can this bypass help?
we
addr
wdata
wdata
MD2
...
JAL 500
r4 r31 + 17
Bypass and Stall Signals
Stall
0x4
re1
we-stall
we-bypass
re2
nop
Add
PC
addr
ASrc
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
we
ws
IR
IR
ws
Cdest
Asanovic/Devadas
Spring 2002
6.823
we
ws
Cdest
IR
W
bypass
A
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
MD1
MD2
weE has to be split into two components
we-bypassE
= ((opcodeE=ALU) + (opcodeE=ALUiE)) . (wsE R0 )
we-stallE
= (opcodeE= LWE).(wsE R0)
+ (opcodeE=JALE) + (opcodeE=JALRE)
ASrc = (rf1D =wsE).we-bypassE . re1D
Stall = ((rf1D =wsE).we-stallE ) + (rf1D=wsM).weM + (rf1D=wsW).weW). re1D
+ ((rf2D = wsE).weE + (rf2D = wsM).weM + (rf2D = wsW).weW). re2D
Fully Bypassed Datapath
Stall
RA
Asanovic/Devadas
Spring 2002
6.823
RA
weW
Add
0x4
Add
Cdest
ASrc nop
PC
wsW
addr
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
Imm
Ext
Is there still
a need for the
Stall signal ?
A
ALU
GPRs
IR
IR
IR
we
addr
rdata
Data
Memory
wdata
BRSrc
MD1
MD2
Stall = (rf1D=wsE). (opcodeE=LWE).(wsER0 ).re1D
+ (rf2D=wsE). (opcodeE=LWE).(wsER0 ).re2D
Why an Instruction may not
be dispatched every cycle
Asanovic/Devadas
Spring 2002
6.823
(CPI>1)
Full bypassing may be too expensive to implement
typically all frequently used paths provided
some infrequently used bypass paths may increase cycle
time and/or cost more than their benefit in reducing CPI
Loads have two cycle latency
Instruction after load cannot use load result
MIPS-I ISA defined load delay slots, a software-visible
pipeline hazard (compiler schedules independent instruction
or inserts NOP to avoid hazard). Removed in MIPS-II.
Conditional branches may cause bubbles
kill following instruction(s) if no delay slots
(Machines with software-visible delay slots may execute
significant number of NOP instructions inserted by compiler code
scheduling)
Multilevel Memories
(Improving performance using a
little cash)
Page 1
CPU-Memory Bottleneck
CPU
Memory
Performance of high-speed computers is usually
limited by memory bandwidth & latency
Latency (time for a single access)
Memory access time >> Processor cycle time
Bandwidth (number of accesses per unit time)
if fraction m of instructions access memory,
1+m memory references / instruction
CPI = 1 requires 1+m memory refs / cycle
2
Page 2
Processor-DRAM Gap (latency)
Proc
60%/yr.
CPU
Moores Law
Processor-Memory
Performance Gap:
(grows 50% / year)
100
10
DRAM
7%/yr.
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
1980
1981
Performance
1000
Time
[Courtesy of David Patterson, UC
Berkeley; Used with Permission]
Four-issue superscalar executes 400 instructions during cache miss!
Multilevel Memory
Strategy: Hide latency using small, fast
memories called caches.
Caches are a mechanism to hide memory
latency based on the empirical observation
that the stream of memory references made
by a processor exhibits locality
loop: ADD r2, r1, r1
SUBI r3, r3, #1
BNEZ r3, loop
PC
96
100
104
108
112
What is the pattern
of instruction
memory addresses?
Page 4
Typical Memory Reference Patterns
Address
linear sequence
n loop iterations
Instruction
fetches
Stack
accesses
Data
accesses
Time
5
Page 5
Locality Properties of Patterns
Two locality properties of memory references:
If a location is referenced it is likely to be
referenced again in the near future.
This is called temporal locality.
If a location is referenced it is likely that
locations near it will be referenced in the near
future.
This is called spatial locality.
Page 6
Caches
Caches exploit both properties of patterns.
Exploit temporal locality by remembering the
contents of recently accessed locations.
Exploit spatial locality by remembering blocks
of contents of recently accessed locations.
Page 7
Memory Hierarchy
A
CPU
Small,
Fast
Memory
(RF, SRAM)
Big, Slow
Memory
(DRAM)
holds frequently used data
size:
Register << SRAM << DRAM
latency:
Register << SRAM << DRAM
bandwidth: on-chip >> off-chip why?
On a data access:
hit (data fast memory)
miss (data fast memory)
why?
why?
low latency access
long latency access (DRAM)
Fast mem effective only if bandwidth reqrmnt at B << A
Due to cost
Due to size of DRAM
Due to cost and wire delays (wires on-chip cost much less, and are faster)
Page 8
Management of Memory Hierarchy
Software managed, e.g., registers
part of the software-visible processor state
software in complete control of storage allocation
but hardware might do things behind softwares back,
e.g., register renaming
Hardware managed, e.g., caches
not part of the software-visible processor state
hardware automatically decides what is kept in
fast memory
but software may provide hints, e.g., dont cache or
prefetch
9
Page 9
A Typical Memory Hierarchy c.2000
Split instruction & data primary
caches (on-chip SRAM)
CPU
RF
Multiported
register file
(part of CPU)
L1
Instruction
Cache
Multiple interleaved
memory banks
(DRAM)
Memory
Unified L2
Cache
L1 Data
Cache
Memory
Memory
Memory
Large unified secondary cache
(on-chip or off-chip SRAM)
10
Page 10
10
Inside a Cache
Address
Processor
Address
CACHE
Data
Data
Main
Memory
copy of main copy of main
memory
memory
location 100 location 101
Address
Tag
100
Data Data
Byte Byte
304
Data
Byte
Line
6848
Data Block
11
Page 11
11
Cache Algorithm (Read)
Look at Processor Address, search cache
tags to find match. Then either
Found in cache
a.k.a. HIT
Return copy
of data from
cache
Not in cache
a.k.a. MISS
Read block of data from
Main Memory
Wait
Which line
do we replace?
Return data to processor
and update cache
12
Page 12
12
Placement Policy
Block Number
1111111111 222222222233
0123456789 0123456789 012345678901
Memory
Set Number
01234567
Cache
block 12
can be placed
Fully
Associative
(2-way) Set
Associative
Direct
Mapped
anywhere
anywhere in
set 0
(12 mod 4)
only into
block 4
(12 mod 8)
13
Page 13
13
Direct-Mapped Cache
Tag
Index
t
V
Tag
Block
Offset
Data Block
2k
lines
t
=
HIT
Data Word or Byte
14
Page 14
14
Fully Associative Cache
V Tag
Data Block
t
Tag
Block
Offset
HIT
Data
Word
or Byte
15
Page 15
15
Direct Map Address Selection
higher-order vs. lower-order address bits
Block Address
indexm
-
tag
offsetb
tag
=
Selection based on lower-order
address bits or higher-order
address bits?
status
data block
valid?
hit?
word
16
Page 16
16
2-Way Set-Associative Cache
Tag
t
Index
k
V Tag Data Block
Block
Offset
V Tag Data Block
t
=
Data
Word
or Byte
HIT
17
Page 17
17
Highly-Associative Caches
For high associativity, use content-addressable
memory (CAM) for tags
Overhead?
tagt
seti
offsetb
Set i
Set 1
Set 0
Tag
Tag
=?
=?
Data Block
Data Block
Tag
=?
Data Block
Only one set enabled
Only hit data accessed
Hit?
Data
18
Advantage is low power because only hit data is accessed.
Page 18
18
Average Read Latency using a Cache
is HIT RATIO: Fraction of references in cache
1 - is MISS RATIO: Remaining references
Average access time for serial search:
Addr
Addr
CACHE
Processor
Data
Data
Main
Memory
tc + (1 - ) tm
Average access time for parallel search:
Addr
CACHE
Processor
Data
Data
Main
Memory
tc + (1 - ) tm
tc is smallest for which type of cache?
19
Page 19
19
Write Policy
Cache hit:
write through: write both cache & memory
- generally higher traffic but simplifies cache coherence
write back: write cache only (memory is written
only when the entry is evicted)
- a dirty bit per block can further reduce the traffic
Cache miss:
no write allocate: only write to main memory
write allocate: (aka fetch on write)
fetch block into cache
Common combinations:
write through and no write allocate
write back with write allocate
20
Page 20
20
Write Performance
Tag
Block
Offset
Index
t
V
Tag
Data
2k
lines
t
=
HIT
WE
Data Word or Byte
21
Page 21
21
Replacement Policy
In an associative cache, which block from a set should
be evicted when the set becomes full?
Random
Least Recently Used (LRU)
LRU cache state must be updated on every access
true implementation only feasible for small sets (2-way easy)
pseudo-LRU binary tree often used for 4-8 way
First In, First Out (FIFO) a.k.a. Round-Robin
used in highly associative caches
This is a second-order effect. Why?
22
Page 22
22
Causes for Cache Misses
Compulsory: first-reference to a block a.k.a. cold
start misses
- misses that would occur even with infinite cache
Capacity: cache is too small to hold all data needed
by the program
- misses that would occur even under perfect
placement & replacement policy
Conflict:
misses that occur because of collisions
due to block-placement strategy
- misses that would not occur with full associativity
Determining the type of a miss requires running
program traces on a cache simulator
23
Page 23
23
Improving Cache Performance
Average memory access time =
Hit time + Miss rate x Miss penalty
To improve performance:
reduce the miss rate (e.g., larger cache)
reduce the miss penalty (e.g., L2 cache)
reduce the hit time
Aside: What is the simplest design strategy?
24
Design the largest primary cache without slowing down the clock
Or adding pipeline stages.
Page 24
24
Block Size and Spatial Locality
Block is unit of transfer between the cache and memory
Tag
Split CPU
address
Word0
Word1
Word2
Word3
block address
2b
4 word block,
b=2
offsetb
b bits
32-b bits
= block size a.k.a line size (in bytes)
Larger block size has distinct hardware advantages
less tag overhead
exploit fast burst transfers from DRAM
exploit fast burst transfers over wide busses
What are the disadvantages of increasing block size?
25
Larger block size will reduce compulsory misses (first miss to a block).
Larger blocks may increase conflict misses since the number of blocks
is smaller.
Page 25
25
Victim Caches (HP 7200)
CPU
RF
Unified L2
Cache
L1 Data
Cache
Replaced data
from L1
Hit data
from VC
(miss
in L1)
Victim
FA Cache
4 blocks
where ?
Replaced data
From VC
Works very well with a direct-mapped cache
26
Page 26
26
Pseudo-Associative Caches (MIPS R1000)
Look at Processor Address, look in indexed
set for data. Then either
HIT
MISS
Return copy
of data from
cache
Invert MSB of index field
Look in pseudo set
MISS
SLOW HIT
Read block of data from
next level of cache
27
Page 27
27
Reducing (Read) Miss Penalty
CPU
Unified
L2 Cache
L1 Data
Cache
Write
buffer
RF
Replaced dirty data from L1 in writeback cache
OR
All writes in writethru cache
Write buffer may hold updated value of location
needed by a read miss
On a read miss, simple scheme is to wait for the write
buffer to go empty
Check write buffer on read miss, if no conflicts, allow
read miss to continue (else, return value in write buffer)
Doesnt help much. Why?
28
Deisgners of the MIPS M/1000 estimated that waiting for a four-word buffer
to empty
increased the read miss penalty by a factor of 1.5.
Page 28
28
Block-level Optimizations
Tags are too large, i.e., too much overhead
Simple solution: Larger blocks, but miss
penalty could be large.
Sub-block placement
A valid bit added to units smaller than the full
block, called sub
- blocks
Only read a sub
- block on a miss
If a tag matches, is the word in the cache?
100
300
204
1
1
0
1
1
1
1
0
0
1
0
1
29
Page 29
29
Write Alternatives
- Writes take two cycles in memory stage, one cycle
for tag check plus one cycle for data write if hit
- Design data RAM that can perform read and write
in one cycle, restore old value after tag miss
- Hold write data for store in single buffer ahead of
cache, write cache data during next stores tag
check
-
Need to bypass from write buffer if read matches write buffer
tag
30
Page 30
30
Pipelining Cache Writes (Alpha 21064)
Tag
Block
Offset
Index
t
V
Tag
Data
2k
lines
t
=
HIT
WE
Bypass
Data Word or Byte
31
Page 31
31
Effect of Cache Parameters on Perf.
Larger cache size
+ reduces capacity and conflict misses
- hit time may increase
Larger block size
+ spatial locality reduces compulsory misses and
capacity reload misses
- fewer blocks may increase conflict miss rate
- larger blocks may increase miss penalty
Higher associativity
+ reduces conflict misses (up to around 4-8 way)
- may increase access time
See page 427 of text for a nice summary
32
Page 32
32
Cache (Memory) Performance
Optimization
Page 1
Improving Cache Performance
Average memory access time =
Hit time + Miss rate x Miss penalty
To improve performance:
reduce the miss rate (e.g., larger cache)
reduce the miss penalty (e.g., L2 cache)
reduce the hit time
The simplest design strategy is to design the
largest primary cache without slowing down the
clock or adding pipeline stages
2
Design the largest primary cache without slowing down the clock
Or adding pipeline stages.
Page 2
RAM-tag 2-way Set-Associative Cache
Tag
t
Index
k
Tag
Data Word
Tag
Data Word
MUX
Set
Data Word
Page 3
CAM-tag 2-way Set-Associative Cache
Tag
t
Index
k
Word
Word
Tag
Demux
Demux
Tag
Data Word
4
Page 4
Causes for Cache Misses
Compulsory: first-reference to a block a.k.a. cold
start misses
- misses that would occur even with infinite cache
Capacity: cache is too small to hold all data needed
by the program
- misses that would occur even under perfect
placement & replacement policy
Conflict:
misses that occur because of collisions
due to block-placement strategy
- misses that would not occur with full associativity
Page 5
Block-level Optimizations
Tags are too large, i.e., too much overhead
Simple solution: Larger blocks, but miss
penalty could be large.
Sub-block placement
A valid bit added to units smaller than the full
block, called sub
- blocks
Only read a sub
- block on a miss
If a tag matches, is the word in the cache?
100
300
204
1
1
0
1
1
1
1
0
0
1
0
1
6
Main reason for subblock placement is to reduce tag overhead.
Page 6
Write Alternatives
- Writes take two cycles in memory stage, one cycle
for tag check plus one cycle for data write if hit
- Design data RAM that can perform read and write
in one cycle, restore old value after tag miss
- Hold write data for store in single buffer ahead of
cache, write cache data during next stores tag
check
-
Need to bypass from write buffer if read matches write buffer
tag
Page 7
Pipelining Cache Writes (Alpha 21064)
Tag
Block
Offset
Index
t
V
Tag
Data
2k
lines
t
=
WE
Bypass
Data Word or Byte
HIT
Writes usually take longer than reads because the tags have to
Be checked before writing the data.
First, tags and data are split, so they can be addressed independently.
Page 8
Prefetching
Speculate on future instruction and data
accesses and fetch them into cache(s)
Instruction accesses easier to predict than data
accesses
Varieties of prefetching
Hardware prefetching
Software prefetching
Mixed schemes
What types of misses does prefetching affect?
9
Reduces compulsory misses, can increase conflict and
capacity misses.
Page 9
Issues in Prefetching
Usefulness should produce hits
Timeliness not late and not too early
Cache and bandwidth pollution
CPU
RF
L1
Instruction
Unified L2
Cache
L1 Data
Prefetched data
10
Page 10
10
Hardware Prefetching of Instructions
Instruction prefetch in Alpha AXP 21064
Fetch two blocks on a miss; the requested
block and the next consecutive block
Requested block placed in cache, and next
block in instruction stream buffer
Req
block
CPU
RF
Stream
Buffer
Prefetched
instruction block
(4 blocks)
L1
Instruction
Req
block
Unified L2
Cache
11
Need to check the stream buffer if the requested block is in there.
Never more than one 32-byte block in the stream buffer.
Page 11
11
Hardware Data Prefetching
One Block Lookahead (OBL) scheme
Initiate prefetch for block b + 1 when block b is
accessed
Why is this different from doubling block size?
Prefetch-on-miss:
Prefetch b + 1 upon miss on b
Tagged prefetch:
Tag bit for each memory block
Tag bit = 0 signifies that a block is demand
fetched or if a prefetched block is referenced for
the first time
Prefetch for b + 1 initiated only if tag bit = 0 on b
12
HP PA 7200 uses OBL prefetching
Tag prefetching is twice as effective as prefetch-on-miss in
reducing miss rates.
Page 12
12
Comparison of Variant OBL Schemes
Prefetch-on-miss accessing contiguous blocks
Demand-fetched
Demand-fetched
Demand-fetched
Prefetched
Prefetched
Prefetched
Demand-fetched
Prefetched
Tagged prefetch accessing contiguous blocks
0 Demand-fetched
0 Demand-fetched
0 Demand-fetched
1 Prefetched
0 Prefetched
0 Prefetched
1 Prefetched
0 Prefetched
1 Prefetched
0
13
Page 13
13
Software Prefetching
for(i=0; i < N; i++) {
fetch( &a[i + 1] );
fetch( &b[i + 1] );
SUM = SUM + a[i] * b[i];
}
What property do we require of the cache for
prefetching to work ?
14
Cache should be non-blocking or lockup-free.
By that we mean that the processor can proceed while the prefetched
Data is being fetched; and the caches continue to supply instructions
And data while waiting for the prefetched data to return.
Page 14
14
Software Prefetching Issues
Timing is the biggest issue, not predictability
If you prefetch very close to when the data is
required, you might be too late
Prefetch too early, cause pollution
Estimate how long it will take for the data to
come into L1, so we can set P appropriately
Why is this hard to do?
for(i=0; i < N; i++) {
fetch( &a[i + P] );
fetch( &b[i + P] );
SUM = SUM + a[i] * b[i];
}
Dont ignore cost of prefetch instructions
15
Page 15
15
Compiler Optimizations
Restructuring code affects the data block
access sequence
Group data accesses together to improve
spatial locality
Re
- order data accesses to improve temporal
locality
Prevent data from entering the cache
Useful for variables that are only accessed
once
Kill data that will never be used
Streaming data exploits spatial locality but not
temporal locality
16
Miss-rate reduction without any hardware changes.
Hardware designers favorite solution.
Page 16
16
Loop Interchange
for(j=0; j < N; j++) {
for(i=0; i < M; i++) {
x[i][j] = 2 * x[i][j];
}
}
for(i=0; i < M; i++) {
for(j=0; j < N; j++) {
x[i][j] = 2 * x[i][j];
}
}
What type of locality does this improve?
17
Page 17
17
Loop Fusion
for(i=0; i < N; i++)
for(j=0; j < M; j++)
a[i][j] = b[i][j] * c[i][j];
for(i=0; i < N; i++)
for(j=0; j < M; j++)
d[i][j] = a[i][j] * c[i][j];
for(i=0; i < M;
for(j=0; j <
a[i][j] =
d[i][j] =
}
i++)
N; j++) {
b[i][j] * c[i][j];
a[i][j] * c[i][j];
What type of locality does this improve?
18
Page 18
18
Blocking
for(i=0; i < N; i++)
for(j=0; j < N; j++) {
r = 0;
for(k=0; k < N; k++)
r = r + y[i][k] * z[k][j];
x[i][j] = r;
}
Not touched
Old access
New access
19
Page 19
19
Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B)
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++)
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}
What type of locality does this improve?
20
Y benefits from spatial locality
z benefits from temporal locality
Page 20
20
CPU-Cache Interaction (5-stage pipeline)
0x4
Add
M
A
nop
PC
addr
inst
IR
D
Decode,
Register
Fetch
ALU
Primary
Data rdata
Cache
wdata
hit?
PCen
we
addr
Primary
Instruction
Cache
MD1
hit?
MD2
Stall entire
CPU on data
cache miss
To Memory Control
Cache Refill Data from Lower Levels
of Memory Hierarchy
Instruction miss ?
21
Page 21
21
Memory (DRAM) Performance
1 Gb
DRAM
50-100 ns access time
Needs refreshing
Upon a cache miss
4 clocks to send the address
24 clocks for the access time per word
4 clocks to send a word of data
Latency worsens with increasing block size
22
Need 128 or 116 clocks, 128 for a dumb memory.
Page 22
22
Memory organizations
CPU
CPU
CPU
Cache
Bus
Cache
Bus
Bus
Cache
Memory
bank 0
bank 1
Memory
Wide memory
Interleaved memory
1 word wide
23
Alpha AXP 21064 256 bits wide memory and cache.
Page 23
23
Interleaved Memory
Banks are often 1 word wide
Send an address to all the banks
How long to get 4 words back?
CPU
Bus
Cache
Bank i contains all
words whose address
modulo N is i
B0
B1
B2
B3
24
4 + 24 + 4* 4 clocks = 44 clocks from interleaved memory.
Page 24
24
Independent Memory
Send an address to all the banks
How long to get 4 words back?
CPU
Bus
Bus
Bus
Non
Blocking
Cache
Bus
Bank i contains all
words whose address
modulo N is i
B0
B1
B2
B3
25
4 + 24 + 4 = 32 clocks from main memory for 4 words.
Page 25
25
Memory Bank Conflicts
Consider a 128-bank memory in the NEC SX/3
where each bank can service independent
requests
int x[256][512];
for(j=0; j < 512; j++)
for(i=0; i < 256; i++)
x[i][j] = 2 * x[i][j];
Consider column k x[i][k], 0 <= i < 256
Address of elements in column is i*512 + k
Where do these addresses go?
26
Page 26
26
Bank Assignment
Bank number = Address MOD NumBanks
Address within bank =
Address
NumBanks
.
Address
within
Bank
0
1
2
3
B0
0
1
2
3
B1
0
1
2
3
B2
0
1
2
3
B3
Should randomize address to bank mapping.
Could use odd/prime number of banks. Problem?
Bank number for i*512 + k should depend on i
Can pick more significant bits in address
27
Page 27
27
Virtual Memory Basics
Page 1
Memory Management
The Fifties:
- Absolute Addresses
- Dynamic address translation
The Sixties:
- Paged memory systems and TLBs
- Atlas Demand paging
Modern Virtual Memory Systems
Page 2
Types of Names for Memory Locations
machine
language
address
ISA
Address
Mapping physical
address
virtual
address
Physical
Memory
(DRAM)
Machine language address
as specified in machine code
Virtual address
ISA specifies translation of machine code address into virtual
address of program variable
Physical address
operating system specifies mapping of virtual address into
name for a physical memory location
Translation of machine code address into virtual address
may involve a segment register.
Physical memory location => actual address signals going to DRAM chips.
Page 3
Absolute Addresses
EDSAC, early 50s
effective address = physical memory address
Only one program ran at a time, with unrestricted access
to entire machine (RAM + I/O devices)
Addresses in a program depended upon where
the program was to be loaded in memory
But it was more convenient for programmers to
write location-independent subroutines
How could location independence be achieved?
4
Led to the development of loaders and linkers to statically relocate and link
Programs.
Page 4
Dynamic Address Translation
Higher throughput if CPU and I/O of 2 or more
programs were overlapped. How?
multiprogramming
prog1
Location independent programs:
Programming and storage management ease
need for a base register
Protection:
Independent programs should not affect
each other inadvertently
need for a bound register
prog2
Physical Memory
Motivation:
In the early machines, I/O operations were slow
and each word transferred involved the CPU
Page 5
Bound
Register
Load X
Program
Address
Space
Effective Addr
Register
Bound
Violation
current
segment
+
Physical
Address
Base
Register
Main Memory
Simple Base and Bound Translation
Base and bounds registers only visible/accessible when
processor running in kernel (a.k.a supervisor) mode
6
Page 6
Load X
Program
Address
Space
Data Bound
Register
Effective Addr
Register
Data Base
Register
Bound
Violation
data
segment
Program Bound
Register
Program
Counter
Program Base
Register
Bound
Violation
Main Memory
Separate Areas for Program and Data
program
segment
What is an advantage of this separation?
Used today on Cray vector supercomputers
Permits sharing of program segments.
Page 7
Memory Fragmentation
OS
Space
user 1
16 K
user 2
24 K
Users 4 & 5
arrive
OS
Space
user 1
16 K
user 2
24 K
user 3
32 K
user 4
free
user 3
free
24 K
user 5
free
24 K
Users 2 & 5
leave
OS
Space
user 1
16 K
free
16 K
32 K
user 4
free
user 3
24 K
free
8 K
24 K
16 K
8 K
32K
24 K
As users come and go, the storage is fragmented.
Therefore, at some stage programs have to be moved
around to compact the storage.
Called Burping the memory.
Page 8
Paged Memory Systems
Processor generated address can be interpreted
as a pair <page number,offset>
page number
offset
A page table contains the physical address of
the base of each page
0
1
2
3
Address Space
of User-1
1
0
0
1
2
3
Page Table
of User-1
What requirement does fixed-length pages plus
indirection through page table relax?
3
2
9
Relaxes the contiguous allocation requirement.
Page 9
User 1
VA1
Page Table
User 2
Physical
Memory
Private Address Space per User
OS
pages
VA1
Page Table
User 3
VA1
Page Table
FREE
Each user has a page table
Page table contains an entry for each user page
10
OS ensures that the page tables are disjoint.
Page 10
10
Where Should Page Tables Reside?
Space required by the page tables is proportional to
the address space, number of users, ...
Space requirement is large
too expensive to keep in registers
Special registers just for the current user:
- What disadvantages does this have?
may not be feasible for large page tables
Main memory:
- needs one reference to retrieve the page base
address and another to access the data word
doubles number of memory references!
11
Affects context-switching overhead, and needs new management
instructions.
Page 11
11
Page Tables in Physical Memory
Page Table, User 1
Page Table, User 2
VA1
User 1
VA1
User 2
12
Page 12
12
A Problem in Early Sixties
There were many applications whose data could not
fit in the main memory, e.g., Payroll
Paged memory system reduced fragmentation but
still required the whole program to be resident in
the main memory
Programmers moved the data back and forth from the
secondary store by overlaying it repeatedly on the
primary store
tricky programming!
13
Page 13
13
Manual Overlays
Assuming an instruction can address
all the storage on the drum
Method 1 - programmer keeps track of
addresses in the main memory and
initiates an I/O transfer when required
Method 2 - automatic initiation of I/O
transfers by software address
translation
Brookers interpretive coding, 1960
40k bits
main
central
store
640k bits
drum
Ferranti Mercury
1956
Method 1 problem ?
Method 2 problem ?
14
British Firm Ferranti, did Mercury and then Atlas
Method 1 too difficult for users
Method 2 too slow.
Page 14
14
Demand Paging in Atlas (1962)
A page from secondary
storage is brought into the
primary storage whenever
it is (implicitly) demanded
by the processor.
Tom Kilburn
Primary
32 Pages
512 words/page
Central
Memory
Secondary
(Drum)
32x6 pages
Primary memory as a cache
for secondary memory
User sees 32 x 6 x 512 words
of storage
15
Single-level Store
Page 15
15
Hardware Organization of Atlas
Effective
Address
Initial
Address
Decode
PARs
48-bit words
512-word pages
Fixed (ROM)
16 pages
0.4 ~1 sec
system code
(not swapped)
Subsidiary
2 pages
1.4 sec
system data
(not swapped)
Main
32 pages
1.4 sec
1 PAR per page frame
Drum (4)
192 pages
Tape
8 decks
88 sec/word
31
(Page Address Register)
<effective PN , status>
Compare the effective page address against all 32 PARs
match
normal access
no match
page fault
the state of the partially executed
instruction was saved
16
Atlas Autocode example here.
Page 16
16
Atlas Demand Paging Scheme
On a page fault:
input transfer into a free page is initiated
the Page Address Register (PAR) is updated
if no free page is left, a page is selected
to be replaced (based on usage)
the replaced page is written on the drum
- to minimize drum latency effect, the first
empty page on the drum was selected
the page table is updated to point to the new
location of the page on the drum
17
This was called the Supervisor program, which clearly
foreshadowed the operating system.
Page 17
17
Caching vs. Demand Paging
secondary
memory
CPU
cache
primary
memory
CPU
Caching
cache entry
cache block (~32 bytes)
cache miss (1% to 20%)
cache hit (~1 cycle)
cache miss (~10 cycles)
a miss is handled
in hardware
primary
memory
Demand paging
page-frame
page (~4K bytes)
page miss (<0.001%)
page hit (~100 cycles)
page miss(~5M cycles)
a miss is handled
mostly in software
18
Page 18
18
Modern Virtual Memory Systems
illusion of a large, private, uniform store
Protection & Privacy
- several users, each with their private
OS
address space and one or more shared
address spaces
page table name space
Demand Paging
- ability to run a program larger than
than the primary memory
useri
Primary
Memory
Swapping
Store
What is another big benefit ?
The price is address translation on
each memory reference
VA
mapping
TLB
PA
19
Portability on machines with different memory configurations.
Page 19
19
Address Translation and Protection
Virtual Address
Virtual Page No. (VPN)
offset
Kernel/User Mode
Read/Write
Protection
Check
Address
Translation
Exception?
Physical Address
Physical Page No. (PPN)
offset
Every instruction and data access needs address
translation and protection checks
A good VM design needs to be fast (~ one cycle) and
space efficient
20
Page 20
20
Linear Page Table
Virtual address
VPN
Data Pages
Offset
Page Table
Page Table Entry (PTE) contains:
PPN (physical page number) of
memory-resident page,
DPN (disk page number) of
page swapped to disk, or
non-existent page
Status bits for protection and
usage
OS changes page table
base register to point to
base of page table for
active user process
PPN
PPN
DPN
PPN
Data word
Offset
DPN
PPN
PPN
DPN
DPN
VPN
DPN
PPN
PPN
PT Base Register
21
Page 21
21
Size of Linear Page Table
With 32-bit addresses, 4-KB pages, and 4-byte PTEs:
220 PTEs, i.e, 4 MB page table per user
4 GB of swap needed to back up full virtual address
space
Larger pages?
more internal fragmentation (dont use all memory in page)
larger page fault penalty (more time to read from disk)
What about 64-bit virtual address space???
Even 1MB pages would require 244 8-byte PTEs (35 TB!)
What is the saving grace ?
22
Virtual address space is large but only a small fraction of the
pages are populated. So we can use a sparse representation
of the table.
Page 22
22
Hierarchical Page Table
Virtual Address
31
22 21
p1
10-bit L1 index
12 11
p2
offset
offset
10-bit L2 index
Root of the Current
Page Table
p2
p1
(Processor
Register)
Level-1
Page Table
page in primary memory
page in secondary memory
Level-2
Page Tables
Data Pages23
PTE of a nonexistent page
Page 23
23
Translation Lookaside Buffers
Address translation is very expensive!
In a two-level page table, each reference becomes
the best case is
the worst case is
?
?
Solution: Cache translations in TLB
TLB hit Single Cycle Translation
TLB miss Page Table Walk to refill
virtual address
VRWD
tag
PPN
VPN
offset
(VPN = virtual page number)
(PPN = physical page number)
hit?
physical address
PPN
offset
24
3 memory references
2 page faults (disk accesses) + ..
Page 24
24
TLB Designs
Typically 32-128 entries
Usually fully associative
Each entry maps a large page, hence less spatial locality
across pages more likely that two entries conflict
Sometimes larger TLBs are 4-8 way set-associative
Random or FIFO replacement policy
Typically only one page mapping per entry
No process information in TLB
TLB Reach: Size of largest virtual address space that can be
simultaneously mapped by TLB
Example: 64 TLB entries, 4KB pages, one page per entry
TLB Reach = ______________________________?
25
Page 25
25
Handling A TLB Miss
Software (MIPS, Alpha)
TLB miss causes an exception and the operating
system walks the page tables and reloads TLB
privileged untranslated addressing mode used
for walk
Hardware (SPARC v8, x86, PowerPC)
A memory management unit (MMU) walks the
page tables and reloads the TLB
If a missing (data or PT) page is encountered
during the TLB reloading, MMU gives up and
signals a Page-Fault exception for the original
instruction
26
Page 26
26
Hierarchical Page Table Walk: SPARC v8
Virtual Address Index 1
31
Context
Table
Register
Context Table
Context
Register
root ptr
Index 2
23
Index 3
17
Offset
11
L1 Table
L2 Table
PTP
L3 Table
PTP
PTE
31
Physical Address
11
PPN
Offset
MMU does this table walk in hardware on a TLB miss
27
Page 27
27
Address Translation: putting it all together
Virtual Address
hardware
hardware or software
software
TLB
Lookup
miss
hit
Page Table
Walk
memory
Protection
Check
the page is
Page Fault
(OS loads page)
Where?
memory
denied
Protection
Fault
Update TLB
SEGFAULT
permitted
Physical
Address
(to cache)
28
Need to restart instruction.
Soft and hard page faults.
Page 28
28
Virtual Memory:
Part Deux
Page 1
Address Translation: putting it all together
Virtual Address
Restart instruction
hardware
hardware or software
software
TLB
Lookup
miss
hit
Page Table
Walk
memory
Protection
Check
the page is
Page Fault
(OS loads page)
memory
denied
Protection
Fault
Update TLB
SEGFAULT
permitted
Physical
Address
(to cache)
2
Need to restart instruction.
Page 2
Address Translation in CPU Pipeline
PC
Inst.
TLB
Inst.
Cache
Decode
TLB miss? Page Fault?
Protection violation?
Data
TLB
Data
Cache
TLB miss? Page fault?
Protection violation?
Need restartable exception on page fault/protection
violation for software handler
Either hardware or software TLB refill on TLB miss
Coping with the additional latency of a TLB:
slow down the clock
pipeline the TLB and cache access
virtual address caches
parallel TLB/cache access
Page 3
Virtual Address Caches
Instead of
CPU
VA
PA
TLB
Physical
Cache
Primary
Memory
place the cache before the TLB
VA
CPU
Virtual
Cache
TLB
PA
Primary
Memory
(StrongARM)
+ one-step process in case of a hit
- cache needs to be flushed on a context switch
unless address space identifiers (ASIDs) included in tags
- aliasing problems due to the sharing of pages
4
Page 4
Aliasing in Virtual-Address Caches
VA1
Page Table
Data Pages
PA
Tag
Data
VA1
1st Copy of Data at PA
VA2
2nd Copy of Data at PA
VA2
Virtual cache can have two
copies of same physical data.
Writes to one copy not visible to
reads of other!
Two virtual pages share
one physical page
General Solution: Disallow aliases to coexist in cache
Software (i.e., OS) solution for direct-mapped cache:
VAs of shared pages must agree in cache index
bits; this ensures all VAs accessing same PA will
conflict in direct-mapped cache (early SPARCs)
Two processes sharing the same file,
Map the same memory segment to different
Parts of their address space.
Page 5
Concurrent Access to TLB & Cache
VA
VPN
TLB
PA
PPN
b
k
Page Offset
Tag
=
hit?
Virtual
Index
Direct-map Cache
2L blocks
b
2 -byte block
Physical Tag
Data
Index L is available without consulting the TLB
cache and TLB accesses can begin simultaneously
Tag comparison is made after both accesses are completed
Cases: L + b = k, L + b < k, L + b > k
Page 6
Virtual-Index Physical-Tag Caches:
Associative Organization
VA
VPN
L = k-b
TLB
PA
PPN
Virtual
Index
2a
b
Direct-map
2L blocks
Direct-map
2L blocks
Phy.
Tag
Page Offset
Tag
hit?
After the PPN is known, 2a physical tags are compared
Data
Is this scheme realistic?
7
Consider 4-Kbyte pages and
caches with 32-byte blocks
32-Kbyte cache 2a = 8
4-Mbyte cache
=1024
2a
No !
Page 7
7
Concurrent Access to TLB & Large L1
Virtual Index
VA
VPN
Page Offset
TLB
PA
PPN
Page Offset
L1 PA cache
Direct-map
VA1 PPN
Data
VA2 PPN
Data
Tag
hit?
Can VA1 and VA2 both map to PA ?
8
If they differ in the lower a bits alone, and share a physical page.
Page 8
Second Level Cache
L1
Instruction
Cache
CPU
RF
Memory
Unified L2
Cache
L1 Data
Cache
Memory
Memory
Memory
Usually a common L2 cache backs up both Instruction
and Data L1 caches
L2 is inclusive of both Instruction and Data caches
9
Page 9
Anti-Aliasing Using L2: (MIPS R10000)
Virtual Index
VA
VPN
TLB
PA
PPN
Page Offset
into L2 tag
Page Offset
L1 PA cache
Direct-map
VA1 PPN
Data
VA2 PPN
Data
b
PPN
Tag
Suppose VA1 and VA2 both map to PA
and VA1 is already in L1, L2 (VA1 VA2)
After VA2 is resolved to PA, a collision
will be detected in L2.
VA1 will be purged from L1 and L2, and
VA2 will be loaded no aliasing !
PA
a1
hit?
Data
Direct-Mapped L2
10
Page 10
10
Virtually-Addressed L1:
Anti-Aliasing using L2
VA
VPN
Page Offset b
TLB
PA
PPN
Tag
Page Offset
Physical
Index & Tag
Virtual
Index & Tag
VA1
Data
VA2
Data
L1 VA Cache
Virtual
Tag
PA VA1
Data
L2 PA Cache
Physically-addressed L2 can also be used L2 contains L1
to avoid aliases in virtually-addressed L1
11
Page 11
11
Page Fault Handler
When the referenced page is not in DRAM:
The missing page is located (or created)
It is brought in from disk, and page table is updated
(A different job may start running on the CPU while the first
job waits for the requested page to be read from disk)
If no free pages are left, a page is swapped out
Pseudo-LRU replacement policy
Since it takes a long time to transfer a page (secs),
page faults are handled completely in software by the
operating system
Untranslated addressing mode is essential to allow kernel to
access page tables
12
Deleted pseudo-LRU policy here. Pseudo-LRU means
You cant afford to do the real thing.
Essentially has 1 bit to indicate if a page has been recently used
Or not. A round-robin policy is used to find the first page
with a value of 0.
Page 12
12
Swapping a Page of a Page Table
A PTE in primary memory contains
primary or secondary memory addresses
A PTE in secondary memory contains
only secondary memory addresses
a page of a PT can be swapped out only
if none its PTEs point to pages in the
primary memory
Why?__________________________________
13
What is the worst thing you can do with respect to storing page tables?
Storing page table on disk for whose entries point to phys. Mem.
Page 13
13
Atlas Revisited
One PAR for each physical page
PARs contain the VPNs of the
pages resident in primary
memory
PARs
PPN
VPN
Advantage: The size is
proportional to the size of the
primary memory
What is the disadvantage ?
14
Good old Atlas which had all the ideas
(Was an improvement on most of its successors.)
Page 14
14
Hashed Page Table:
Approximating Associative Addressing
VPN
Virtual Address
Page Table
PID
hash
Offset
PA of PTE
Base of Table
VPN PID PPN
Hashed Page Table is typically 2 to 3
times larger than the number of PPNs
to reduce collision probability
VPN PID DPN
VPN PID
It can also contain DPNs for some
non-resident pages (not common)
If a translation cannot be resolved in
this table then the software consults a
data structure that has an entry for
every existing page
Primary
Memory
15
Page 15
15
Global System Address Space
User
map
Level A
User
Global
System
Address
Space
map
Physical
Memory
Level B
map
Level A maps users address spaces into the global
space providing privacy, protection, sharing etc.
Level B provides demand-paging for the large global
system address space
Level A and Level B translations may be kept in
16
separate TLBs
Page 16
16
Hashed Page Table Walk:
PowerPC Two-level, Segmented Addressing
Seg ID
64-bit user VA
0
hashS
PA of Seg Table
per process
63
PA
Global Seg ID
0
Page
51
hashP
Offset
51
Hashed Segment Table
80-bit System VA
PA of Page Table
system-wide
Page
35
Offset
67
79
Hashed Page Table
PA
0
40-bit PA
27
PPN
39
Offset
17
Page 17
17
Power PC: Hashed Page Table
VPN
80-bit VA
Page Table
hash
Offset
PA of Slot
VPN
VPN
PPN
Base of Table
Each hash table slot has 8 PTE's <VPN,PPN>
that are searched sequentially
If the first hash slot fails, an alternate hash
function is used to look in another slot
All these steps are done in hardware!
Hashed Table is typically 2 to 3 times larger
than the number of physical pages
The full backup Page Table is a software
data structure
Primary
Memory
18
Page 18
18
Interrupts:
altering the normal flow of control
program
Ii-1
HI1
Ii
HI2
Ii+1
HIn
interrupt
handler
An external or internal event that needs to be processed
by another (system) program. The event is usually
unexpected or rare from programs point of view.
19
Page 19
19
Causes of Interrupts
An interrupt is an event that requests the attention of
the processor
Asynchronous: an external event
input/output device service-request
timer expiration
power disruptions, hardware failure
Synchronous: an internal event (a.k.a exceptions)
undefined opcode, privileged instruction
arithmetic overflow, FPU exception
misaligned memory access
virtual memory exceptions: page faults,
TLB misses, protection violations
traps: system calls, e.g., jumps into kernel
20
Page 20
20
Asynchronous Interrupts:
invoking the interrupt handler
An I/O device requests attention by asserting one of
the prioritized interrupt request lines
When the processor decides to process the interrupt
it stops the current program at instruction Ii ,
completing all the instructions up to Ii-1
(precise interrupt)
it saves the PC of instruction Ii in a special
register (EPC)
disables interrupts and transfers control to a
designated interrupt handler running in kernel
mode
21
Page 21
21
Interrupt Handler
To allow nested interrupts, EPC is saved before
enabling interrupts
need an instruction to move EPC into GPRs
need a way to mask further interrupts at least
until EPC can be saved
A status register indicates the cause of the interrupt it must be visible to an interrupt handler
The return from an interrupt handler is a simple
indirect jump but usually involves
enabling interrupts
restoring the processor to the user mode
restoring hardware status and control state
a special return-from-exception instruction (RFE)
22
Page 22
22
Synchronous Interrupts
A synchronous interrupt (exception) is caused by
a particular instruction
In general, the instruction cannot be completed
and needs to be restarted after the exception has
been handled
requires undoing the effect of one or more
partially executed instructions
In case of a trap (system call), the instruction is
considered to have been completed
a special jump instruction involving a
change to privileged kernel mode
23
Page 23
23
Exception Handling (5-Stage Pipeline)
PC
Inst.
Mem
Decode
Illegal
Opcode
Overflow
PC Address
Exceptions
Data
Mem
Data Address
Exceptions
Asynchronous
Interrupts
How to handle multiple simultaneous exceptions in
different pipeline stages?
How and where to handle external asynchronous
interrupts?
24
Page 24
24
Exception Handling (5-Stage Pipeline)
Commit
Point
PC
Select
Handler
PC
Inst.
Mem
Decode
Illegal
Opcode
PC Address
Exceptions
Kill F
Stage
Overflow
Exc
D
PC
D
Exc
E
Kill D
Stage
PC
E
Data
Mem
Data Address Kill
Exceptions Writeback
Exc
M
Kill E
Stage
PC
M
Cause
EPC
Asynchronous
Interrupts
25
Page 25
25
Exception Handling (5-Stage Pipeline)
Hold exception flags in pipeline until commit point
(M stage)
Exceptions in earlier pipe stages override later
exceptions for a given instruction
Inject external interrupts at commit point (override
others)
If exception at commit: update Cause and EPC
registers, kill all stages, inject handler PC into fetch
stage
26
Page 26
26
Virtual Memory Use Today
Desktops/servers have full demand-paged virtual memory
Portability between machines with different memory sizes
Protection between multiple users or multiple tasks
Share small physical memory among active tasks
Simplifies implementation of some OS features
Vector supercomputers have translation and protection but not demand-paging
(Crays: base&bound, Japanese: pages)
Dont waste expensive CPU time thrashing to disk (make jobs fit in memory)
Mostly run in batch mode (run set of jobs that fits in memory)
More difficult to implement restartable vector instructions
Most embedded processors and DSPs provide physical addressing only
Cant afford area/speed/power budget for virtual memory support
Often there is no secondary storage to swap to!
Programs custom written for particular memory configuration in product
Difficult to implement restartable instructions for exposed architectures27
Page 27
27
Asanovic/Devadas
Spring 2002
6.823
Complex Pipelining
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Asanovic/Devadas
Spring 2002
6.823
Complex Pipelining: Motivation
Pipelining becomes complex when we want high
performance in the presence of
Memory systems with variable access time
Long latency or partially pipelined floating-point
units
Multiple function and memory units
Realistic Memory Systems
Asanovic/Devadas
Spring 2002
6.823
Latency of access to the main memory is usually
much greater than one cycle and often unpredictable
Solving this problem is a central issue in computer
architecture
Common solutions
separate instruction and data memory ports
and buffers
no self-modifying code
caches
single cycle except in case of a miss stall
interleaved memory
multiple memory accesses stride conflicts
split-phase memory operations
out-of-order responses
Floating Point Unit
Asanovic/Devadas
Spring 2002
6.823
Much more hardware than an integer unit
Single-cycle floating point units are a bad idea - why?
it is common to have several floating point units
it is common to have different types of FPU's
Fadd, Fmul, Fdiv, ...
an FPU may be pipelined, partially pipelined or not
pipelined
To operate several FPUs concurrently the register
file needs to have more read and write ports
Function Unit Characteristics
fully
pipelined
partially
pipelined
Asanovic/Devadas
Spring 2002
6.823
busy 1cyc 1cyc 1cyc accept
busy
2 cyc
2 cyc
accept
Function units have internal pipeline registers
operands are latched when an instruction
enters a function unit
inputs to a function unit (e.g., register file)
can change during a long latency operation
Asanovic/Devadas
Spring 2002
6.823
Multiple Function Units
ALU
IF
ID
Issue
GPRs
FPRs
Mem
WB
Fadd
Fmul
Fdiv
Floating Point ISA
Asanovic/Devadas
Spring 2002
6.823
Interaction between the Floating point datapath
and the Integer datapath is determined largely
by the ISA
DLX ISA
separate register files for FP and Integer
instructions
separate load/store for FPRs and GPRs
but both use GPRs for address calculation
separate conditions for branches
FP branches are defined in terms of
condition codes
the only interaction is via a set of move
instructions (some ISAs dont even permit this)
New Possibilities of Hazards
Asanovic/Devadas
Spring 2002
6.823
Structural conflicts at the write-back stage due to
variable latencies of different function units
Structural conflicts at the execution stage if some
FPU or memory unit is not pipelined and takes more
than one cycle
Out-of-order write hazards due to variable latencies
of different function units
appropriate pipeline interlocks can resolve
all these hazards
Asanovic/Devadas
Spring 2002
6.823
Write-Back Stage Arbitration
Is the latency of each function unit fixed and known?
Yes - structural hazards can be avoided by
delaying the instruction from entering the
execution phase
No - WB stage has to be arbitrated dynamically
WB stage may exert back pressure on
a function unit
Issue may not dispatch an instruction
into that unit until the unit is free
We will assume that the latencies are not known
(This is the more general case)
Asanovic/Devadas
Spring 2002
6.823
WB and Back Pressure
WB
Issue
1cyc
GPR
1cyc 1cyc 1cyc
Interconnect
FPR
2 cyc
2 cyc
Conflict?
CDC 6600 Seymour Cray, 1963
Asanovic/Devadas
Spring 2002
6.823
A fast pipelined machine with 60-bit words
(128 Kword main memory capacity, 32 banks)
Ten functional units (parallel, unpipelined)
Floating Point adder, 2 multipliers, divider
Integer adder, 2 incrementers, ...
Hardwired control (no microcoding)
Dynamic scheduling of instructions using a
scoreboard
Ten Peripheral Processors for Input/Output
- a fast multi-threaded 12-bit integer ALU
Very fast clock, 10 MHz (FP add in 4 clocks)
>400,000 transistors, 750 sq. ft., 5 tons, 150 kW,
novel freon-based technology for cooling
Fastest machine in world for 5 years (until 7600)
over 100 sold ($7-10M each)
IBM Memo on CDC6600
Asanovic/Devadas
Spring 2002
6.823
Thomas Watson Jr., IBM CEO, August 1963:
Last week, Control Data ... announced the
6600 system. I understand that in the
laboratory developing the system there are
only 34 people including the janitor. Of these,
14 are engineers and 4 are programmers...
Contrasting this modest effort with our vast
development activities, I fail to understand why
we have lost our industry leadership position
by letting someone else offer the world's most
powerful computer.
To which Cray replied: It seems like Mr. Watson
has answered his own question.
Asanovic/Devadas
Spring 2002
6.823
CDC 6600: Datapath
Operand Regs
8 x 60-bit
operand
10 Functional
Units
result
Central
Memory
Address Regs
8 x 18-bit
load
addr
store
addr
Index Regs
8 x 18-bit
IR
Inst. Stack
8 x 60-bit
CDC 6600:
A Load/Store Architecture
Asanovic/Devadas
Spring 2002
6.823
Separate instructions to manipulate three types of reg.
8 60-bit data registers (X)
8 18-bit address registers (A)
8 18-bit index registers (B)
All arithmetic and logic instructions are reg-to-reg
6
opcode
Ri (Rj) op (Rk)
Only Load and Store instructions refer to memory!
6
18
opcode
disp
Ri M[(Rj) + disp]
Touching address registers 1 to 5 initiates a load
6 to 7 initiates a store
- very useful for vector operations
CDC6600: Vector Addition
B0 - n
loop: JZE B0, exit
A0 B0 + a0
A1 B0 + b0
X6 X0 + X1
A6 B0 + c0
B0 B0 + 1
jump loop
exit:
load X0
load X1
store X6
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
Dependence Analysis
Types of Data Hazards
Asanovic/Devadas
Spring 2002
6.823
Consider executing a sequence of
rk (ri) op (rj)
type of instructions
Data-dependence
r3 (r1) op (r2)
r5 (r3) op (r4)
Read-after-Write
(RAW) hazard
Anti-dependence
r3 (r1) op (r2)
r1 (r4) op (r5)
Write-after-Read
(WAR) hazard
Output-dependence
r3 (r1) op (r2)
r3 (r6) op (r7)
Write-after-Write
(WAW) hazard
Detecting Data Hazards
Asanovic/Devadas
Spring 2002
6.823
Range and Domain of instruction i
R(i) = Registers (or other storage) modified
by instruction i
D(i) = Registers (or other storage) read
by instruction i
Suppose instruction j follows instruction i in the
program order. Executing instruction j before the
effect of instruction i has taken place can cause a
RAW hazard if
R(i) D(j)
WAR hazard if
D(i) R(j)
WAW hazard if
R(i) R(j)
Register vs. Memory
Data Dependence
Asanovic/Devadas
Spring 2002
6.823
Data hazards due to register operands can be
determined at the decode stage
but
data hazards due to memory operands can be
determined only after computing the effective address
store
load
M[(r1) + disp1] (r2)
r3 M[(r4) + disp2]
Does (r1 + disp1) = (r4 + disp2) ?
Data Hazards: An Example
worksheet
I1
DIVD
f6,
f6,
I2
LD
f2,
45(r3)
I3
MULTD
f0,
f2,
f4
I4
DIVD
f8,
f6,
f2
I5
SUBD
f10,
f0,
f6
I6
ADDD
f6,
f8,
f2
RAW Hazards
WAR Hazards
WAW Hazards
f4
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
Instruction Scheduling
I1
DIVD
f6,
f6,
f4
I2
LD
f2,
45(r3)
I3
MULTD
f0,
f2,
f4
I4
DIVD
f8,
f6,
f2
I1
I5
SUBD
f10,
f0,
f6
I6
ADDD
f6,
f8,
f2
I2
I3
I4
Valid orderings:
in-order
I1
I2
I3
I4
I5
I6
out-of-order
I2
I1
I3
I4
I5
I6
out-of-order
I1
I2
I3
I5
I4
I6
I5
I6
Asanovic/Devadas
Spring 2002
6.823
Out-of-order Completion
In-order Issue
f4
Latency
4
I1
DIVD
f6,
f6,
I2
LD
f2,
45(r3)
I3
MULTD
f0,
f2,
f4
I4
DIVD
f8,
f6,
f2
I5
SUBD
f10,
f0,
f6
I6
ADDD
f6,
f8,
f2
in-order comp
1 2
1 2 3 4
out-of-order comp
1 2 2 3 1 4 3 5 5 4 6 6
3 5 4 6 5 6
Asanovic/Devadas
Spring 2002
6.823
Scoreboard:
A Data Structure to Detect Hazards
When is it Safe to Issue an
Instruction?
Asanovic/Devadas
Spring 2002
6.823
For each instruction at the Issue stage the following
checks need to be made
Is the required function unit available?
Is the input data available? RAW?
Is it safe to write the destination? WAR?
WAW?
Is there a structural conflict at the WB stage?
Assume there is only one instruction in the Issue stage
and if it cannot be issued then the Instruction Fetch and
Decode stall
Asanovic/Devadas
Spring 2002
6.823
A Data Structure for Correct Issues
Keeps track of the status of Functional Units
Name
Busy
Op
Dest Src1 Src2
Int
Mem
Add1
Add2
Add3
Mult1
Mult2
Div
The instruction i at the Issue stage consults this table
FU available?
RAW?
WAR?
WAW?
check the busy column
search the dest column for is sources
search the source columns for is destination
search the dest column for is destination
An entry is added to the table if no hazard is detected;
An entry is removed from the table after Write-Back
Simplifying the Data Structure
Assuming In-order Issue
Asanovic/Devadas
Spring 2002
6.823
Suppose the instruction is not dispatched by the Issue
stage if a RAW hazard exists or the required FU is busy
Can the dispatched instruction cause a
WAR hazard ?
WAW hazard ?
Simplifying the Data Structure
Assuming In-order Issue
Asanovic/Devadas
Spring 2002
6.823
Suppose the instruction is not dispatched by the Issue
stage if a RAW hazard exists or the required FU is busy
Can the dispatched instruction cause a
WAR hazard ?
NO: Operands read at issue
WAW hazard ?
YES: Out-of-order completion
Asanovic/Devadas
Spring 2002
6.823
Simplifying the Data Structure ...
No WAR hazard no need to keep src1 and src2
The Issue stage does not dispatch an instruction in
case of a WAW hazard
a register name can occur at most once in
the dest column
WP[reg#] : a bit-vector to record the registers for which
writes are pending
These bits are set to true by the Issue stage and set to
false by the WB stage
Each pipeline stage in the FU's must carry the
dest field and a flag to indicate if it is valid
the (we, ws) pair
Scoreboard for In-order Issues
Asanovic/Devadas
Spring 2002
6.823
Busy[unit#] : a bit-vector to indicate units availability.
(unit = Int, Add, Mult, Div)
These bits are hardwired to FU's.
WP[reg#] : a bit-vector to record the registers for which
writes are pending
Issue checks the instruction (opcode dest src1 src2)
against the scoreboard (Busy & WP) to dispatch
FU available?
RAW?
WAR?
WAW?
not Busy[FU#]
WP[src1] or WP[src2]
cannot arise
WP[dest]
Scoreboard Dynamics
Functional Unit Status
Registers Reserved
WB
for Writes
Int(1) Add(1) Mult(3) Div(4)
t0 I1
f6
t1 I2 f2
f6
t2
f6
f2
t3 I3
f0
f6
t4
f0
f6
t5 I4
f0 f8
t6
f8
f0
t7 I5
f10
f8
t8
f8 f10
t9
f8
t10 I6
f6
t11
f6
I1
I2
I3
I4
I5
I6
DIVD
LD
MULTD
DIVD
SUBD
ADDD
f6,
f2,
f0,
f8,
f10,
f6,
f6,
45(r3)
f2,
f6,
f0,
f8,
Asanovic/Devadas
Spring 2002
6.823
f4
f4
f2
f6
f2
f6
f6, f2
f6, f2
f6, f0
f6, f0
f0, f8
f0, f8
f8, f10
f8, f10
f8
f6
f6
I2
I1
I3
I5
I4
I6
Asanovic/Devadas
Spring 2002
6.823
Out-of-Order Execution &
Register Renaming
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Scoreboard for In-order Issue
Asanovic/Devadas
Spring 2002
6.823
Busy[unit#] : a bit-vector to indicate units availability.
(unit = Int, Add, Mult, Div)
These bits are hardwired to FU's.
WP[reg#] : a bit-vector to record the registers for which
writes are pending
Issue checks the instruction (opcode dest src1 src2)
against the scoreboard (Busy & WP) to dispatch
FU available?
RAW?
WAR?
WAW?
not Busy[FU#]
WP[src1] or WP[src2]
cannot arise
WP[dest]
Out-of-Order Dispatch
ALU
IF
ID
Issue
Asanovic/Devadas
Spring 2002
6.823
Mem
WB
Fadd
Fmul
Issue stage buffer holds multiple instructions waiting
to issue.
Decode adds next instruction to buffer if there is
space and the instruction does not cause a WAR
or WAW hazard.
Any instruction in buffer whose RAW hazards are
satisfied can be dispatched (for now, at most one
dispatch per cycle). On a write back (WB), new
instructions may get enabled.
Asanovic/Devadas
Spring 2002
6.823
Out-of-Order Issue: an example
1
LD
F2,
34(R2)
latency
1
LD
F4,
45(R3)
long
MULTD
F6,
F4,
F2
SUBD
F8,
F2,
F2
DIVD
F4,
F2,
F8
ADDD
F10,
F6,
F4
5
6
In-order:
1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
Out-of-order: 1 (2,1) 4 4 . . . . 2 3 . . 3 5 . . . 5 6 6
Out-of-order did not allow any significant improvement !
Asanovic/Devadas
Spring 2002
6.823
How many Instructions can be
in the pipeline
Which features of an ISA limit the number of
instructions in the pipeline?
Which features of a program limit the number of
instructions in the pipeline?
Overcoming the Lack of
Register Names
Asanovic/Devadas
Spring 2002
6.823
Number of registers in an ISA limits the number
of partially executed instructions in complex pipelines
Floating Point pipelines often cannot be kept filled with
small number of registers.
IBM 360 had only 4 Floating Point Registers
Can a microarchitecture use more registers than
specified by the ISA without loss of ISA compatibility ?
Robert Tomasulo of IBM suggested an
ingenious solution in 1967 based on on-the-fly
register renaming
Instruction-Level Parallelism
with Renaming
1
LD
F2,
34(R2)
LD
F4,
45(R3)
latency
1
Asanovic/Devadas
Spring 2002
6.823
long
MULTD
F6,
F4,
F2
SUBD
F8,
F2,
F2
DIVD
F4,
F2,
F8
ADDD
F10,
F6,
F4
X
5
6
In-order:
1 (2,1) . . . . . . 2 3 4 4 (5,3) . . . 5 6 6
Out-of-order: 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6
Any antidependence can be eliminated by renaming
renaming additional storage
Can it be done in hardware? yes!
Register Renaming
ALU
IF
ID
ROB
Asanovic/Devadas
Spring 2002
6.823
Mem
WB
Fadd
Fmul
Decode does register renaming and adds instructions
to the issue stage reorder buffer (ROB).
renaming makes WAR or WAW hazards
impossible
Any instruction in ROB whose RAW hazards have
been satisfied can be dispatched.
Out-of order or dataflow execution
Renaming & Out-of-order Issue
Asanovic/Devadas
Spring 2002
6.823
An example
Renaming table
p
data
Reorder buffer
Ins# use exec op p1
F1
F2
F3
F4
F5
F6
F7
F8
src1 p2 src2
t1
t2
.
.
.
data / ti
1
2
3
4
5
6
LD
LD
MULTD
SUBD
DIVD
ADDD
F2,
F4,
F6,
F8,
F4,
F10,
34(R2)
45(R3)
F4,
F2,
F2,
F6,
F2
F2
F8
F4
When are names in sources
replaced by data?
When can a name be reused?
Asanovic/Devadas
Spring 2002
6.823
Data-Driven Execution
Renaming
table &
reg file
Reorder
buffer
Replacing the
tag by its value
is an expensive
operation
Ins# use exec op
Load
Unit
FU
p1
src1 p2 src2
FU
t1
t2
.
.
tn
Store
Unit
< t, result >
Instruction template (i.e., tag t) is allocated by the
Decode stage, which also stores the tag in the reg file
When an instruction completes, its tag is deallocated
Asanovic/Devadas
Spring 2002
6.823
Simplifying Allocation/Deallocation
Reorder buffer
Ins# use exec op p1
ptr2
next to
deallocate
prt1
next
available
src1 p2
src2
t1
t2
.
.
.
tn
Instruction buffer is managed circularly
When an instruction completes its use bit is
marked free
ptr2 is incremented only if the use bit is marked free
IBM 360/91 Floating Point Unit
Asanovic/Devadas
Spring 2002
6.823
R. M. Tomasulo, 1967
1
2
3
4
5
6
data
distribute
instruction
templates
by
functional
units
instructions
load
buffers
(from
memory)
1 p data
2
3
1
2
3
4
...
data
Floating
Point
Reg
data
1 p data
2
Adder
data
Mult
< t, result >
store
buffers
(to memory)
data
Common bus ensures that data is
made available immediately to all
the instructions waiting for it
Asanovic/Devadas
Spring 2002
6.823
Effectiveness?
Renaming and Out-of-order execution was first
implemented in 1969 in IBM 360/91 but did not show
up in the subsequent models until mid-Nineties.
Why ?
Reasons
1. Exceptions not precise!
2. Effective on a very small class of programs
One more problem needed to be solved
Precise Interrupts
Asanovic/Devadas
Spring 2002
6.823
It must appear as if an interrupt is taken between two
instructions (say Ii and Ii+1)
the effect of all instructions up to and including Ii
is totally complete
no effect of any instruction after Ii has taken place
The interrupt handler either aborts the program or
restarts it at Ii+1 .
Asanovic/Devadas
Spring 2002
6.823
Effect on Interrupts
Out-of-order Completion
I1
I2
I3
I4
I5
I6
DIVD
LD
MULTD
DIVD
SUBD
ADDD
f6,
f2,
f0,
f8,
f10,
f6,
f6,
45(r3)
f2,
f6,
f0,
f8,
f4
f4
f2
f6
f2
out-of-order comp 1 2 2 3 1 4 3 5 5 4 6 6
restore f2
restore f10
Consider interrupts
Precise interrupts are difficult to implement at high speed
- want to start execution of later instructions before
exception checks finished on earlier instructions
Exception Handling
(In-Order Five-Stage
PC
Select
Handler
PC
Inst.
Mem
Decode
Illegal
Opcode
PC Address
Exceptions
Kill F
Stage
Commit
Pipeline) Point
Overflow
Data
Mem
Asanovic/Devadas
Spring 2002
6.823
Data Address Kill
Exceptions Writeback
Exc
D
Exc
E
Exc
M
Cause
PC
D
PC
E
PC
M
EPC
Kill D
Stage
Kill E
Stage
Asynchronous
Interrupts
Hold exception flags in pipeline until commit point (M stage)
Exceptions in earlier pipe stages override later exceptions
Inject external interrupts at commit point (override others)
If exception at commit: update Cause and EPC registers, kill
all stages, inject handler PC into fetch stage
Phases of Instruction Execution
Asanovic/Devadas
Spring 2002
6.823
PC
I-cache
Fetch
Buffer
Issue
Buffer
Func.
Units
Result
Buffer
Arch.
State
Fetch: Instruction bits retrieved from
cache.
Decode: Instructions placed in appropriate
issue stage buffer (sometimes called
issue or dispatch)
Execute: Instructions and operands sent to
execution units (sometimes called issue or
dispatch). When execution completes, all
results and exception flags are available.
Commit: Instruction irrevocably updates
architectural state (sometimes called
graduation or completion).
In-Order Commit for
Precise Exceptions
Asanovic/Devadas
Spring 2002
6.823
Instructions fetched and decoded into instruction reorder buffer in-order
Execution is out-of-order ( out-of-order completion)
Commit (write-back to architectural state, regfile+memory) is in-order
Temporary storage needed to hold results before commit
(shadow registers and store buffers)
Out-of-order
In-order
Fetch
Reorder Buffer
Decode
Inject handler PC
Commit
Kill
Kill
Kill
In-order
Execute
Exception?
Asanovic/Devadas
Spring 2002
6.823
Extensions for Precise Exceptions
Instruction reorder buffer
Inst# use exec op p1
src1 p2 src2
pd dest
data cause
ptr2
next to
commit
ptr1
next
available
add <pd, dest, data, cause> fields in the instruction template
commit instructions to reg file and memory in program
order buffers can be maintained circularly
on exception, clear reorder buffer by resetting ptr1=ptr2
(stores must wait for commit before updating memory)
Rollback and Renaming
Asanovic/Devadas
Spring 2002
6.823
Register File
(now holds only
committed state)
Reorder
buffer
Ins# use exec op
Load
Unit
FU
p1
src1 p2
FU
src2
FU
pd dest
Store
Unit
data
Commit
< t, result >
Register file does not contain renaming tags any more.
How does the decode stage find the tag of a source
register?
t1
t2
.
.
tn
Asanovic/Devadas
Spring 2002
6.823
Renaming Table
Rename
Table
Reorder
buffer
r1
r2
ti
vi
Ins# use exec op
Load
Unit
FU
Register
File
p1
src1 p2
FU
src2
FU
pd dest
Store
Unit
data
t1
t2
.
.
tn
Commit
< t, result >
Renaming table is like a cache to speed up register name
look up (rename tag + valid bit per entry). It needs to be
cleared after each exception taken . When else are valid
bits cleared? ______________________________________
Asanovic/Devadas
Spring 2002
6.823
Effect of Control Transfer on
Pipelined Execution
Control transfer instructions require insertion
of bubbles in the pipeline.
The number of bubbles depends upon the number
of cycles it takes
to determine the next instruction address, and
to fetch the next instruction
Branch Penalty
Next fetch
started
PC
I-cache
Fetch
Buffer
Fetch
Decode
Issue
Buffer
Func.
Units
Branch executed
Result
Buffer
Arch.
State
Execute
Commit
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
Branch Penalties in Modern Pipelines
UltraSPARC-III instruction fetch pipeline stages
(in-order issue, 4-way superscalar, 750MHz, 2000)
Fetch
A PC Generation/Mux
P Instruction Fetch Stage 1
F
Decode
B Branch Address Calc/Begin Decode
I Complete Decode
J
Execute
Instruction Fetch Stage 2
Steer Instructions to Functional units
R Register File Read
E Integer Execute
Remainder of execute pipeline (+another 6 stages)
Branch penalty: Cycles?________ Instructions?_________
Average Run-Length between
Branches
Asanovic/Devadas
Spring 2002
6.823
Average dynamic instruction mix from SPEC92:
SPECint92 SPECfp92
ALU
39 %
13 %
FPU Add
20 %
FPU Mult
13 %
load
26 %
23 %
store
9%
9%
branch
16 %
8%
other
10 %
12 %
SPECint92:
SPECfp92:
compress, eqntott, espresso, gcc , li
doduc, ear, hydro2d, mdijdp2, su2cor
What is the average run length between branches?
Asanovic/Devadas
Spring 2002
6.823
Reducing Control Transfer Penalties
Software solution
loop unrolling
Increases the run length
instruction scheduling
Compute the branch condition as early
as possible
(limited)
Hardware solution
delay slots
replaces pipeline bubbles with useful work
(requires software cooperation)
branch prediction & speculative execution
of instructions beyond the branch
Branch Prediction
Asanovic/Devadas
Spring 2002
6.823
Motivation: branch penalties limit performance of deeply
pipelined processors
Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly
Required hardware support:
Prediction structures: branch history tables, branch
target buffers, etc.
Mispredict recovery mechanisms:
In-order machines: kill instructions following
branch in pipeline
Out-of-order machines: shadow registers and
memory buffers for each speculated branch
DLX Branches and Jumps
Instruction
BEQZ/BNEZ
J
JR
Taken known?
After Reg. Fetch
Always Taken
Always Taken
Asanovic/Devadas
Spring 2002
6.823
Target known?
After Inst. Fetch
After Inst. Fetch
After Reg. Fetch
Must know (or guess) both target address and
whether taken to execute branch/jump.
Asanovic/Devadas
Spring 2002
6.823
Branch Prediction &
Speculative Execution
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Average Run-Length between
Branches
Asanovic/Devadas
Spring 2002
6.823
Average dynamic instruction mix from SPEC92:
SPECint92 SPECfp92
ALU
39 %
13 %
FPU Add
20 %
FPU Mult
13 %
load
26 %
23 %
store
9%
9%
branch
16 %
8%
other
10 %
12 %
SPECint92:
SPECfp92:
compress, eqntott, espresso, gcc , li
doduc, ear, hydro2d, mdijdp2, su2cor
What is the average run length between branches?
Asanovic/Devadas
Spring 2002
6.823
Reducing Control Transfer Penalties
Software solution
loop unrolling
Increases the run length
instruction scheduling
Compute the branch condition as early
as possible
(limited)
Hardware solution
delay slots
replaces pipeline bubbles with useful work
(requires software cooperation)
branch prediction & speculative execution
of instructions beyond the branch
Asanovic/Devadas
Spring 2002
6.823
Effect of Control Transfer on
Pipelined Execution
Control transfer instructions require insertion
of bubbles in the pipeline.
The number of bubbles depends upon the number
of cycles it takes
to determine the next instruction address, and
to fetch the next instruction
DLX Branches and Jumps
Asanovic/Devadas
Spring 2002
6.823
Must know (or guess) both target address and
whether taken to execute branch/jump.
Instruction
Taken known?
Target known?
BEQZ/BNEZ
J
JR
After Reg. Fetch
Always Taken
Always Taken
After Inst. Fetch
After Inst. Fetch
After Reg. Fetch
Branch Penalty
Next fetch
started
PC
I-cache
Fetch
Buffer
Fetch
Decode
Issue
Buffer
Func.
Units
Branch executed
Result
Buffer
Arch.
State
Execute
Commit
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
Branch Penalties in Modern Pipelines
UltraSPARC-III instruction fetch pipeline stages
(in-order issue, 4-way superscalar, 750MHz, 2000)
A PC Generation/Mux
P Instruction Fetch Stage 1
Branch
Target
Address
Known
Branch
Direction &
Jump
Register
Target Known
Instruction Fetch Stage 2
B Branch Address Calc/Begin Decode
I Complete Decode
J
Steer Instructions to Functional units
R Register File Read
E Integer Execute
Remainder of execute pipeline (+ another 6 stages)
Branch Prediction
Asanovic/Devadas
Spring 2002
6.823
Motivation: branch penalties limit performance of deeply
pipelined processors
Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly
Required hardware support:
Prediction structures: branch history tables, branch
target buffers etc.
Mispredict recovery mechanisms:
In-order machines: kill instructions following
branch in pipeline
Out-of-order machines: shadow registers and
memory buffers for each speculated branch
Static Branch Prediction
Asanovic/Devadas
Spring 2002
6.823
(Encode prediction as part of branch instruction)
Probability a
backward
branch is taken
90%
(~60-70% overall):
JZ
forward
50%
JZ
Can predict all taken,
or backwards taken/forward not-taken (use offset sign bit)
ISA can attach additional semantics to branches about
preferred direction, e.g., Motorola MC88110
bne0 (preferred taken)
beq0 (not taken)
ISA can allow arbitrary choice of statically predicted
direction (HP PA-RISC, Intel IA-64)
Asanovic/Devadas
Spring 2002
6.823
Dynamic Branch Prediction
learning based on past behavior
Temporal correlation
The way a branch resolves may be a good predictor
of the way it will resolve at the next execution
Spatial correlation
Several branches may resolve in a highly correlated
manner (a preferred path of execution)
Branch Prediction Bits
Asanovic/Devadas
Spring 2002
6.823
Assume 2 BP bits per instruction
Change the prediction after two consecutive mistakes!
taken
taken
take
right
take
wrong
taken
taken
take
right
taken
taken
taken
take
wrong
taken
BP state:
(predict take/take) x (last prediction right/wrong)
Branch History Table
Fetch PC
00
k
I-Cache
Instruction
Opcode
Asanovic/Devadas
Spring 2002
6.823
BHT Index
2k-entry
BHT,
2 bits/entry
offset
+
Branch?
Target PC
Taken/Taken?
4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
Exploiting Spatial Correlation
Asanovic/Devadas
Spring 2002
6.823
Yeh and Patt, 1992
if (x[i] < 7) then
y += 1;
if (x[i] < 5) then
c -= 4;
If first condition false, second condition also false
History bit: H records the direction of the last branch
executed by the processor
Two sets of BHT bits (BHT0 & BHT1) per branch
instruction
H = 0 (not taken)
H = 1 (taken)
consult BHT0
consult BHT1
Two-Level Branch Predictor
Asanovic/Devadas
Spring 2002
6.823
Pentium Pro uses the result from the last two branches
to select one of the four sets of BHT bits (~95% correct)
00
Fetch PC
2-bit global branch
history shift register
Shift in
Taken/Taken
results of each
branch
Taken/Taken?
Limitations of BHTs
Asanovic/Devadas
Spring 2002
6.823
Cannot redirect fetch stream until after branch instruction
is fetched and decoded, and target address determined
Correctly
predicted taken
branch penalty
Jump Register
penalty
A PC Generation/Mux
P Instruction Fetch Stage 1
F
Instruction Fetch Stage 2
B Branch Address Calc/Begin Decode
I Complete Decode
J
Steer Instructions to Functional units
R Register File Read
E Integer Execute
Remainder of execute
pipeline (+ another 6 stages)
(UltraSPARC-III fetch pipeline)
Branch Target Buffer (BTB)
I-Cache
Asanovic/Devadas
Spring 2002
6.823
2k-entry direct-mapped BTB
(can also be associative)
PC
Entry PC
Valid
predicted
target PC
=
match
valid target
Keep both the branch PC and target PC in the BTB
PC+4 is fetched if match fails
Only taken branches and jumps held in BTB
Next PC determined before branch fetched and decoded
Combining BTB and BHT
Asanovic/Devadas
Spring 2002
6.823
BTB entries are considerably more expensive than
BHT, but can redirect fetches at earlier stage in
pipeline and can accelerate indirect branches (JR)
BHT can hold many more entries and is more accurate
BTB
A PC Generation/Mux
P Instruction Fetch Stage 1
F
BHT in later
pipeline stage
corrects when
BTB misses a
predicted
taken branch
BHT
Instruction Fetch Stage 2
B Branch Address Calc/Begin Decode
I Complete Decode
J
Steer Instructions to Functional units
R Register File Read
E Integer Execute
BTB/BHT only updated after branch resolves in E stage
Uses of Jump Register (JR)
Asanovic/Devadas
Spring 2002
6.823
Switch statements (jump to address of matching case)
Dynamic function call (jump to run-time function
address)
Subroutine returns (jump to return address)
How well does BTB work for each of these cases?
BTB Performance
Asanovic/Devadas
Spring 2002
6.823
Switch statements (jump to address of matching case)
BTB works well if same case used repeatedly
Dynamic function call (jump to run-time function
address)
BTB works well if same function usually called,
(e.g., in C++ programming, when objects have
same type in virtual function call)
Subroutine returns (jump to return address)
BTB works well if usually return to the same place
Often one function called from many different
call sites!
Subroutine Return Stack
Asanovic/Devadas
Spring 2002
6.823
Small structure to accelerate JR for subroutine returns,
typically much more accurate than BTBs.
fa() { fb(); }
fb() { fc(); }
fc() { fd(); }
Pop return address
when subroutine return
decoded
Push call address when
function call executed
&fd()
&fc()
&fb()
k entries
(typically k=8-16)
Speculating Both Directions
Asanovic/Devadas
Spring 2002
6.823
An alternative to branch prediction is to execute both
directions of a branch speculatively
resource requirement is proportional to the
number of concurrent speculative executions
only half the resources engage in useful work
when both directions of a branch are executed
speculatively
branch prediction takes less resources
than speculative execution of both paths
With accurate branch prediction, it is more cost effective
to dedicate all resources to the predicted direction
Mispredict Recovery
Asanovic/Devadas
Spring 2002
6.823
In-order execution machines:
Assume no instruction issued after branch can
write-back before branch resolves
Kill all instructions in pipeline behind mispredicted
branch
Out-of-order execution?
Multiple instructions following branch in program
order can complete before branch resolves
In-Order Commit for
Precise Exceptions
Asanovic/Devadas
Spring 2002
6.823
Instructions fetched and decoded into instruction reorder buffer in-order
Execution is out-of-order ( out-of-order completion)
Commit (write-back to architectural state, regfile+memory) is in-order
Temporary storage in ROB holds results before commit
Out-of-order
In-order
Fetch
Reorder Buffer
Decode
Inject handler PC
Commit
Kill
Kill
Kill
In-order
Execute
Exception!
ROB for Precise Exceptions
ptr2
next to
commit
Inst# use exec op p1
src1 p2 src2
pd dest
Asanovic/Devadas
Spring 2002
6.823
data cause
ptr1
next
available
add <pd, dest, data, cause> fields in the instruction template
commit instructions to reg file and memory in program
order buffers can be maintained circularly
on exception, clear reorder buffer by resetting ptr1=ptr2
(stores must wait for commit before updating memory)
Branch Misprediction Recovery
ptr2
next to
commit
Inst# use exec op p1
rollback
next
available
ptr1
next
available
src1 p2 src2
pd dest
Asanovic/Devadas
Spring 2002
6.823
data cause
BEQZ
Speculative Instructions
On mispredict
Roll back next available pointer to just after branch
Reset use bits
Flush mis-speculated instructions from pipelines
Restart fetch on correct branch path
Asanovic/Devadas
Spring 2002
6.823
Branch Misprediction in Pipeline
restart fetch on correct path
Branch
Prediction
kill
kill
Branch
Resolution
kill
PC
Fetch
Decode
Reorder Buffer
Commit
Complete
Execute
Can have multiple unresolved branches in ROB
Can resolve branches out-of-order
Killing Speculative Instructions
Asanovic/Devadas
Spring 2002
6.823
Each instruction tagged with single speculative bit,
carried throughout all pipelines
Decode stage stalls if second branch encountered
before first speculative branch resolves
When speculative branch resolves:
Prediction incorrect, kill all instructions tagged as
speculative
Prediction correct, clear speculative bit on all
instructions
To allow speculation past multiple branches, add
multiple bits per instruction indicating on which
outstanding branches it is speculative
speculation bits reclaimed when
corresponding branch resolves and is oldest
speculative branch in machine (manage
allocation of speculation tag bits circularly)
Recovering Renaming Table
Rename
Table
Reorder
buffer
r1
r2
ti
ti
ti
ti
v
vi i
v
vi i
Ins# use exec op
Load
Unit
FU
Asanovic/Devadas
Spring 2002
6.823
Register
Rename
Snapshots File
p1
src1 p2
FU
src2
FU
pd dest
Store
Unit
data
Commit
< t, result >
Take snapshot of register rename table at each predicted
branch, recover earlier snapshot if branch mispredicted
t1
t2
.
.
tn
Asanovic/Devadas
Spring 2002
6.823
Advanced Superscalar
Architectures
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Physical Register Renaming
Asanovic/Devadas
Spring 2002
6.823
(single physical register file: MIPS R10K, Alpha 21264, Pentium-4)
During decode, instructions allocated new physical destination register
Source operands renamed to physical register with newest value
Execution unit only sees physical register numbers
ld r1, (r3)
add r3, r1, #4
sub r6, r7, r9
add r3, r3, r6
ld r6, (r1)
add r6, r6, r3
st r6, (r1)
ld r6, (r11)
Rename
ld P1, (Px)
add P2, P1,
sub P3, Py,
add P4, P2,
ld P5, (P1)
add P6, P5,
st P6, (P1)
ld P7, (Pw)
#4
Pz
P3
P4
Asanovic/Devadas
Spring 2002
6.823
Physical Register File
r1
r2
ti
tj
Rename
Table
t1
t2
.
tn
Snapshots for
mispredict recovery
Load
Unit
FU
FU
Reg
File
FU
Store
Unit
(ROB not shown)
< t, result >
One regfile for both committed and speculative values (no data in ROB)
During decode, instruction result allocated new physical register, source
regs translated to physical regs through rename table
Instruction reads data from regfile at start of execute (not in decode)
Write-back updates reg. busy bits on instructions in ROB (assoc. search)
Snapshots of rename table taken at every branch to recover mispredicts
On exception, renaming undone in reverse order of issue (MIPS R10000)
Speculative and Out-of-Order
Execution
Asanovic/Devadas
Spring 2002
6.823
Update predictors
kill
Branch
Prediction
PC
Fetch
kill
Branch
Resolution
kill
Decode &
Rename
kill
Out-of-Order
Reorder Buffer
In-Order
Commit
In-Order
Physical Reg. File
Branch
ALU MEM
Unit
Execute
Store
Buffer
D$
Lifetime of Physical Registers
Asanovic/Devadas
Spring 2002
6.823
Physical regfile holds committed and speculative values
Physical registers decoupled from ROB entries
(no data in ROB)
ld r1, (r3)
add r3, r1, #4
sub r6, r7, r9
add r3, r3, r6
ld r6, (r1)
add r6, r6, r3
st r6, (r1)
ld r6, (r11)
Rename
ld P1, (Px)
add P2, P1,
sub P3, Py,
add P4, P2,
ld P5, (P1)
add P6, P5,
st P6, (P1)
ld P7, (Pw)
#4
Pz
P3
P4
When can we reuse a physical register?
When next write of same architectural register commits
Physical Register Management
Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7
P8
P7
P5
P6
P0
P1
P2
P3
P4
P5
P6
P7
P8
<R6>
<R7>
<R3>
<R1>
p
p
p
p
p2
Rd
Free List
P0
P1
P3
P2
P4
ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)
Pn
ROB
use ex
Physical Regs
Asanovic/Devadas
Spring 2002
6.823
op
p1
PR1
PR2
LPRd
PRd
(LPRd requires third read port on Rename Table for each instruction)
Physical Register Management
Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7
Physical Regs
P0
P1
P2
P3
P4
P5
P6
P7
P8
P8 P0
P7
P5
P6
p
p
p
p
p2
Rd
Free List
P0
P1
P3
P2
P4
ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)
Pn
ROB
use ex
<R6>
<R7>
<R3>
<R1>
Asanovic/Devadas
Spring 2002
6.823
op
ld
p1
PR1
P7
PR2
r1
LPRd
P8
PRd
P0
(LPRd requires third read port on Rename Table for each instruction)
Physical Register Management
Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7
P8 P0
P7 P1
P5
P6
x
x
P0
P1
P2
P3
P4
P5
P6
P7
P8
<R6>
<R7>
<R3>
<R1>
p
p
p
p
p2
Rd
Free List
P0
P1
P3
P2
P4
ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)
Pn
ROB
use ex
Physical Regs
Asanovic/Devadas
Spring 2002
6.823
op
p1
ld
p
add
PR1
P7
P0
PR2
r1
r3
LPRd
P8
P7
PRd
P0
P1
(LPRd requires third read port on Rename Table for each instruction)
Physical Register Management
Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7
P8 P0
P7 P1
P5 P3
P6
x
x
x
P0
P1
P2
P3
P4
P5
P6
P7
P8
<R6>
<R7>
<R3>
<R1>
p
p
p
p
p2
PR2
Rd
P5
Free List
P0
P1
P3
P2
P4
ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)
Pn
ROB
use ex
Physical Regs
Asanovic/Devadas
Spring 2002
6.823
op
p1
ld
p
add
sub p
PR1
P7
P0
P6
r1
r3
r6
LPRd
P8
P7
P5
PRd
P0
P1
P3
(LPRd requires third read port on Rename Table for each instruction)
Physical Register Management
Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7
P8 P0
P7 P1 P2
P5 P3
P6
x
x
x
x
P0
P1
P2
P3
P4
P5
P6
P7
P8
<R6>
<R7>
<R3>
<R1>
p
p
p
p
p2
PR2
Rd
P5
P3
Free List
P0
P1
P3
P2
P4
ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)
Pn
ROB
use ex
Physical Regs
Asanovic/Devadas
Spring 2002
6.823
op
p1
ld
p
add
sub p
add
PR1
P7
P0
P6
P1
r1
r3
r6
r3
LPRd
P8
P7
P5
P1
PRd
P0
P1
P3
P2
(LPRd requires third read port on Rename Table for each instruction)
Physical Register Management
Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7
P8 P0
P7 P1 P2
P5 P3 P4
P6
x
x
x
x
x
P0
P1
P2
P3
P4
P5
P6
P7
P8
<R6>
<R7>
<R3>
<R1>
p
p
p
p
p2
PR2
Rd
P5
P3
Free List
P0
P1
P3
P2
P4
ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)
Pn
ROB
use ex
Physical Regs
Asanovic/Devadas
Spring 2002
6.823
op
p1
ld
p
add
sub p
add
ld
PR1
P7
P0
P6
P1
P0
r1
r3
r6
r3
r6
LPRd
P8
P7
P5
P1
P3
PRd
P0
P1
P3
P2
P4
(LPRd requires third read port on Rename Table for each instruction)
Physical Register Management
Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7
Physical Regs
P0
P1
P2
P3
P4
P5
P6
P7
P8
P8 P0
P7 P1 P2
P5 P3 P4
P6
x
x
x
x
x
<R6>
<R7>
<R3>
<R1>
p
p
p
p
p2
PR2
Rd
P5
P3
Free List
P0
P1
P3
P2
P4
ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)
P8
Pn
ROB
use ex
<R1>
Asanovic/Devadas
Spring 2002
6.823
op
ld
add
sub
add
ld
p1
p
p
p
p
PR1
P7
P0
P6
P1
P0
r1
r3
r6
r3
r6
LPRd
P8
P7
P5
P1
P3
PRd
P0
P1
P3
P2
P4
Execute &
Commit
(LPRd requires third read port on Rename Table for each instruction)
Physical Register Management
Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7
Physical Regs
P0
P1
P2
P3
P4
P5
P6
P7
P8
P8 P0
P7 P1 P2
P5 P3 P4
P6
x
x
x
x
x
x
x
p
p
<R6>
<R7>
<R3>
p
p
p
p2
PR2
Rd
P5
P3
Free List
P0
P1
P3
P2
P4
ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)
P8
P7
Pn
ROB
use ex
<R1>
<R3>
Asanovic/Devadas
Spring 2002
6.823
op
ld
add
sub
add
ld
p1
p
p
p
p
p
PR1
P7
P0
P6
P1
P0
r1
r3
r6
r3
r6
LPRd
P8
P7
P5
P1
P3
PRd
P0
P1
P3
P2
P4
Execute &
Commit
(LPRd requires third read port on Rename Table for each instruction)
Reorder Buffer Holds
Active Instruction Window
(Older instructions)
ld r1, (r3)
add r3, r1, r2
sub r6, r7, r9
add r3, r3, r6
ld r6, (r1)
add r6, r6, r3
st r6, (r1)
ld r6, (r1)
(Newer instructions)
Cycle t
Commit
Execute
Fetch
ld r1, (r3)
add r3, r1,
sub r6, r7,
add r3, r3,
ld r6, (r1)
add r6, r6,
st r6, (r1)
ld r6, (r1)
Cycle t + 1
Asanovic/Devadas
Spring 2002
6.823
r2
r9
r6
r3
Asanovic/Devadas
Spring 2002
6.823
Superscalar Register Renaming
During decode, instructions allocated new physical destination register
Source operands renamed to physical register with newest value
Execution unit only sees physical register numbers
Update
Mapping
Write
Ports
Inst 1 Op Dest Src1 Src2
Read Addresses
Rename Table
Read Data
Op PDest PSrc1 PSrc2
Op Dest Src1 Src2 Inst 2
Register
Free List
Op PDest PSrc1 PSrc2
Does this work?
Asanovic/Devadas
Spring 2002
6.823
Superscalar Register Renaming
Update
Mapping
Write
Ports
Inst 1 Op Dest Src1 Src2
Read Addresses
Rename Table
Read Data
Op PDest PSrc1 PSrc2
Inst 2
Op Dest Src1 Src2
=?
=?
Register
Free List
Op PDest PSrc1 PSrc2
Must check for RAW hazards between instructions issuing in
same cycle. Can be done in parallel with rename lookup.
(MIPS R10K renames 4 serially-RAW-dependent insts/cycle)
Asanovic/Devadas
Spring 2002
6.823
Memory Dependencies
st r1, (r2)
ld r3, (r4)
When can we execute the load?
Speculative Loads / Stores
Asanovic/Devadas
Spring 2002
6.823
Just like register updates, stores should not modify
the memory until after the instruction is committed
store buffer entry must carry a speculation bit and
the tag of the corresponding store instruction
If the instruction is committed, the speculation bit of
the corresponding store buffer entry is cleared, and
store is written to cache
If the instruction is killed, the corresponding store
buffer entry is freed
Loads work normally -- older store buffer entries
needs to be searched before accessing the memory
or the cache
Asanovic/Devadas
Spring 2002
6.823
Load Path
Load Address
Speculative
Store Buffer
VS
Tag
VS
Tag
VS
Tag
VS
Tag
VS
Tag
VS
Tag
L1 Data Cache
Data
Data
Data
Data
Data
Data
Tags
Data
Store Commit Path
Load Data
Hit in speculative store buffer has priority over hit in data
cache
Hit to newer store has priority over hits to older stores in
speculative store buffer
Datapath: Branch Prediction
and Speculative Execution
Asanovic/Devadas
Spring 2002
6.823
Update predictors
Branch
Prediction
kill
kill
Branch
Resolution
kill
PC
Fetch
Decode &
Rename
kill
Reorder Buffer
Commit
Reg. File
Branch
ALU MEM
Unit
Execute
Store
Buffer
D$
In-Order Memory Queue
Asanovic/Devadas
Spring 2002
6.823
Execute all loads and stores in program order
=> Load and store cannot leave ROB for execution
until all previous loads and stores have
completed execution
Can still execute loads and stores speculatively,
and out-of-order with respect to other instructions
Stores held in store buffer until commit
Conservative Out-of-Order
Load Execution
Asanovic/Devadas
Spring 2002
6.823
st r1, (r2)
ld r3, (r4)
Split execution of store instruction into two
phases: address calculation and data write
Can execute load before store, if addresses
known and r4 != r2
Each load address compared with addresses of
all previous uncommitted stores (can use partial
conservative check i.e., bottom 12 bits of address)
Dont execute load if any previous store address
not known
(MIPS R10K, 16 entry address queue)
Address Speculation
Asanovic/Devadas
Spring 2002
6.823
st r1, (r2)
ld r3, (r4)
Guess that r4 != r2
Execute load before store address known
Need to hold all completed but uncommitted
load/store addresses in program order
If subsequently find r4==r2, squash load and all
following instructions
=> Large penalty for inaccurate address speculation
Asanovic/Devadas
Spring 2002
6.823
Memory Dependence Prediction
(Alpha 21264)
st r1, (r2)
ld r3, (r4)
Guess that r4 != r2 and execute load before store
If later find r4==r2, squash load and all following
instructions, but mark load instruction as store-wait
Subsequent executions of the same load instruction
will wait for all previous stores to complete
Periodically clear store-wait bits
Asanovic/Devadas
Spring 2002
6.823
Improving Instruction Fetch
Performance of speculative out-of-order
machines often limited by instruction fetch
bandwidth
speculative execution can fetch 2-3x more
instructions than are committed
mispredict penalties dominated by time to refill
instruction window
taken branches are particularly troublesome
Increasing Taken Branch Bandwidth
Asanovic/Devadas
Spring 2002
6.823
(Alpha 21264 I-Cache)
PC
Generation
Branch Prediction
Instruction Decode
Validity Checks
Tag Tag
PC
Line
Way
Cached
Way Way
Predict Predict Instructions
0
1
4 insts
fast fetch path
=?
=?
Hit/Miss/Way
Fold 2-way tags and BTB into predicted next block
Take tag checks, inst. decode, branch predict out of loop
Raw RAM speed on critical loop (1 cycle at ~1 GHz)
2-bit hysteresis counter per block prevents overtraining
Tournament Branch Predictor
Asanovic/Devadas
Spring 2002
6.823
(Alpha 21264)
Local
history
table
(1,024x10b)
Local
prediction
(1,024x3b)
Global Prediction
(4,096x2b)
Choice
Prediction
(4,096x2b)
PC
Prediction
Global History (12b)
Choice predictor learns whether best to use local or
global branch history in predicting next branch
Global history is speculatively updated but restored on
mispredict
Claim 90-100% success on range of applications
Taken Branch Limit
Asanovic/Devadas
Spring 2002
6.823
Integer codes have a taken branch every 6-9
instructions
To avoid fetch bottleneck, must execute multiple taken
branches per cycle when increasing performance
This implies:
predicting multiple branches per cycle
fetching multiple non-contiguous blocks per cycle
Branch Address Cache
Asanovic/Devadas
Spring 2002
6.823
(Yeh, Marr, Patt)
Entry PC
Valid
predicted
target #1
len
predicted
target #2
PC
=
match valid
target#1 len#1 target#2
Extend BTB to return multiple branch predictions
per cycle
Fetching Multiple Basic Blocks
Requires either
multiported cache: expensive
interleaving: bank conflicts will occur
Merging multiple blocks to feed to decoders adds
latency increasing mispredict penalty and
reducing branch throughput
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
Trace Cache
Key Idea: Pack multiple non-contiguous basic
blocks into one contiguous trace cache line
BR
BR
BR
BR
BR
BR
Single fetch brings in multiple basic blocks
Trace cache indexed by start address and next n branch
predictions
Used in Intel Pentium-4 processor to hold decoded uops
MIPS R10000 (1995)
Asanovic/Devadas
Spring 2002
6.823
0.35m CMOS, 4 metal layers
Four instructions per cycle
Out-of-order execution
Register renaming
Speculative execution past 4
branches
On-chip 32KB/32KB split I/D
cache, 2-way set-associative
Off-chip L2 cache
Non-blocking caches
Compare with simple 5-stage
pipeline (R5K series)
~1.6x performance
SPECint95
~5x CPU logic area
~10x design effort
Asanovic/Devadas
Spring 2002
6.823
VLIW/EPIC: Statically Scheduled ILP
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Littles Law
Asanovic/Devadas
Spring 2002
6.823
Parallelism = Throughput * Latency
Throughput per Cycle
One Operation
Latency in Cycles
Asanovic/Devadas
Spring 2002
6.823
Example Pipelined ILP Machine
Max Throughput, Six Instructions per Cycle
One Pipeline Stage
Two Integer Units,
Latency Single
Cycle Latency
in Cycles
Two Load/Store Units,
Three Cycle Latency
Two Floating-Point Units,
Four Cycle Latency
How much instruction-level parallelism (ILP) required
to keep machine pipelines busy?
Asanovic/Devadas
Spring 2002
6.823
Superscalar Control Logic Scaling
Issue Width N
Issue Group
Previously
Issued
Instructions
Lifetime
Number of interlock checks and/or tag compares grows as
N*(N*L) where L is lifetime of instructions in machine
Each of N instructions issued or completed must check against N*L
instructions in flight
For in-order machines, lifetime L is related to pipeline latencies
For out-of-order machines, L also includes time spent in
instruction buffers (instruction window or ROB)
As N increases, need larger instruction window to find enough
parallelism to keep machine busy => greater lifetime L
=> Out-of-order control logic grows faster than N^2 (~N^3)
Asanovic/Devadas
Spring 2002
6.823
Out-of-Order Control Complexity:
MIPS R10000
Control
Logic
Sequential ISA Bottleneck
Sequential
source code
Superscalar compiler
a = foo(b);
for (i=0, i<
Find independent
operations
Schedule
operations
Superscalar processor
Check instruction
dependencies
Schedule
execution
Asanovic/Devadas
Spring 2002
6.823
Sequential
machine code
Asanovic/Devadas
Spring 2002
6.823
VLIW: Very Long Instruction Word
Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1
FP Op 2
Two Integer Units,
Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency
Two Floating-Point Units,
Four Cycle Latency
Compiler schedules parallel execution
Multiple parallel operations packed into one long
instruction word
Compiler must avoid data hazards (no interlocks)
Asanovic/Devadas
Spring 2002
6.823
Early VLIW Machines
FPS AP120B (1976)
scientific attached array processor
first commercial wide instruction machine
hand-coded vector math libraries using software pipelining and loop
unrolling
Multiflow Trace (1987)
commercialization of ideas from Fishers Yale group including trace
scheduling
available in configurations with 7, 14, or 28 operations/instruction
28 operations packed into a 1024-bit instruction word
Cydrome Cydra-5 (1987)
7 operations encoded in 256-bit instruction word
rotating register file
Asanovic/Devadas
Spring 2002
6.823
Loop Execution
for (i=0; i<N; i++)
Int1
B[i] = A[i] + C;
Compile
loop:
loop: ld f1, 0(r1)
add r1, 8
fadd f2, f0, f1
sd f2, 0(r2)
add r2, 8
bne r1, r3, loop
Schedule
Int 2
M1
M2
FP+
FPx
Asanovic/Devadas
Spring 2002
6.823
Loop Execution
for (i=0; i<N; i++)
Int1
B[i] = A[i] + C;
Compile
loop:
Int 2
add r1
M1
M2
ld
fadd
loop: ld f1, 0(r1)
add r1, 8
Schedule
fadd f2, f0, f1
sd f2, 0(r2)
FP+
add r2 bne
sd
add r2, 8
bne r1, r3, loop
How many FP ops/cycle?
1 fadd / 8 cycles = 0.125
FPx
Loop Unrolling
Asanovic/Devadas
Spring 2002
6.823
for (i=0; i<N; i++)
B[i] = A[i] + C;
Unroll inner loop to perform
4 iterations at once
for (i=0; i<N; i+=4)
{
B[i]
= A[i] + C;
B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}
Need to handle values of N that are not multiples of
unrolling factor with final cleanup loop
Scheduling Loop Unrolled Code
Asanovic/Devadas
Spring 2002
6.823
Unroll 4 ways
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1)
ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
sd f8, 24(r2)
add r2, 32
bne r1, r3, loop
Int1
loop:
Schedule
Int 2
M1
M2
FP+
FPx
Scheduling Loop Unrolled Code
Asanovic/Devadas
Spring 2002
6.823
Unroll 4 ways
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1)
ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
sd f8, 24(r2)
add r2, 32
bne r1, r3, loop
Int1
loop:
add r1
Int 2
M1
ld f1
ld f2
ld f3
ld f4
Schedule
How many FLOPS/cycle?
sd f5
sd f6
sd f7
add r2 bne sd f8
M2
FP+
fadd f5
fadd f6
fadd f7
fadd f8
FPx
Asanovic/Devadas
Spring 2002
6.823
Software Pipelining
Unroll 4 ways first
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1)
ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
add r2, 32
sd f8, -8(r2)
bne r1, r3, loop
Int1
Int 2
M1
M2
FP+
FPx
Asanovic/Devadas
Spring 2002
6.823
Software Pipelining
Int1
Unroll 4 ways first
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1)
ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
add r2, 32
sd f8, -8(r2)
bne r1, r3, loop
Int 2
M1
ld f1
ld f2
ld f3
add r1
ld f4
prolog
ld f1
ld f2
ld f3
add r1
ld f4
ld f1
loop:
iterate
ld f2
add r2 ld f3
add r1bne ld f4
epilog
How many FLOPS/cycle?
add r2
bne
M2
sd f5
sd f6
sd f7
sd f8
sd f5
sd f6
sd f7
sd f8
sd f5
FP+
fadd f5
fadd f6
fadd f7
fadd f8
fadd f5
fadd f6
fadd f7
fadd f8
fadd f5
fadd f6
fadd f7
fadd f8
FPx
Asanovic/Devadas
Spring 2002
6.823
Software Pipelining
vs. Loop Unrolling
Startup overhead
Loop Unrolled
Wind-down overhead
performance
Loop Iteration
time
Software Pipelined
performance
Loop Iteration
time
Software pipelining pays startup/wind-down
costs only once per loop, not once per iteration
Asanovic/Devadas
Spring 2002
6.823
What if there are no loops?
Basic block
Branches limit basic block size
in control-flow intensive
irregular code
Difficult to find ILP in individual
basic blocks
Asanovic/Devadas
Spring 2002
6.823
Trace Scheduling [ Fisher,Ellis]
Asanovic/Devadas
Spring 2002
6.823
Trace Scheduling [ Fisher,Ellis]
Pick string of basic blocks, a trace, that
represents most frequent branch path
Use profiling feedback or compiler
heuristics to find common branch paths
Schedule whole trace at once
Add fixup code to cope with branches
jumping out of trace
Asanovic/Devadas
Spring 2002
6.823
Problems with Classic VLIW
Object-code compatibility
have to recompile all code for every machine, even for two
machines in same generation
Object code size
instruction padding wastes instruction memory/cache
loop unrolling/software pipelining replicates code
Scheduling variable latency memory operations
caches and/or memory bank conflicts impose statically
unpredictable variability
Scheduling around statically unpredictable
branches
optimal schedule varies with branch path
Asanovic/Devadas
Spring 2002
6.823
VLIW Instruction Encoding
Various schemes to reduce effect of unused fields
Compressed format in memory, expand on I-cache refill
Cydra-5 MultiOp instructions: execute VLIW as sequential operations
Mark parallel groups (used in TMS320C6x DSPs, Intel IA-64)
Group 1
Group 2
Group 3
Rotating Register Files
Asanovic/Devadas
Spring 2002
6.823
Problem: Scheduled loops require lots of registers, lots of
duplicated code in prolog, epilog
ld r1, ()
add r2, r1, #1
ld r1, ()
st r2, ()
add r2, r1, #1
ld r1, ()
st r2, ()
add r2, r1, #1
st r2, ()
Prolog
Loop
Epilog
ld r1, ()
ld r1, ()
add r2, r1, #1
ld r1, ()
add r2, r1, #1
st r2, ()
add r2, r1, #1
st r2, ()
st r2, ()
Solution: Allocate new set of registers for each loop iteration
Asanovic/Devadas
Spring 2002
6.823
Rotating Register File
RRB=3
R1
P7
P6
P5
P4
P3
P2
P1
P0
Rotating Register Base (RRB) register points to base of
current register set. Value added on to logical register
specifier to give physical register number. Usually, split into
rotating and non-rotating registers.
Prolog
Loop
Epilog
ld r1, ()
dec RRB
ld r1, ()
add r3, r2, #1
dec RRB
ld r1, ()
add r3, r2, #1
st r4, ()
bloop
add r2, r1, #1
st r4, ()
dec RRB
st r4, ()
dec RRB
Loop closing
branch
decrements
RRB
Asanovic/Devadas
Spring 2002
6.823
Rotating Register File
(Previous Loop Example)
Three cycle load latency
encoded as difference of
3 in register specifier
number (f4 - f1 = 3)
Four cycle fadd latency
encoded as difference of
4 in register specifier
number (f9 f5 = 4)
ld f1, ()
fadd f5, f4, ...
sd f9, ()
bloop
ld P9, ()
fadd P13, P12,
sd P17, () bloop
RRB=8
ld P8, ()
fadd P12, P11,
sd P16, () bloop
RRB=7
ld P7, ()
fadd P11, P10,
sd P15, () bloop
RRB=6
ld P6, ()
fadd P10, P9,
sd P14, () bloop
RRB=5
ld P5, ()
fadd P9, P8,
sd P13, () bloop
RRB=4
ld P4, ()
fadd P8, P7,
sd P12, () bloop
RRB=3
ld P3, ()
fadd P7, P6,
sd P11, () bloop
RRB=2
ld P2, ()
fadd P6, P5,
sd P10, () bloop
RRB=1
Predicate Software Pipeline
Stages
Asanovic/Devadas
Spring 2002
6.823
Single VLIW Instruction
(p1) ld r1
(p2) add r3
(p3) st r4 (p1) bloop
Dynamic Execution
(p1) ld r1
(p1) bloop
(p1) ld r1
(p2) add r3
(p1) bloop
(p1) ld r1
(p2) add r3
(p3) st r4 (p1) bloop
(p1) ld r1
(p2) add r3
(p3) st r4 (p1) bloop
(p1) ld r1
(p2) add r3
(p3) st r4 (p1) bloop
(p2) add r3
(p3) st r4 (p1) bloop
(p3) st r4 (p1) bloop
Software pipeline stages turned on by rotating predicate registers
Much denser encoding of loops
Asanovic/Devadas
Spring 2002
6.823
Cydra-5:
Memory Latency Register (MLR)
Problem: Loads have variable latency
Solution: Let software choose desired memory latency
Compiler tries to schedule code for maximum load-use distance
Software sets MLR to latency that matches code schedule
Hardware ensures that loads take exactly MLR cycles to return
values into processor pipeline
Hardware buffers loads that return early
Hardware stalls processor if loads return late
Asanovic/Devadas
Spring 2002
6.823
Intel EPIC IA-64
EPIC is the style of architecture (cf. CISC, RISC)
Explicitly Parallel Instruction Computing
IA-64 is Intels chosen ISA (cf. x86, MIPS)
IA-64 = Intel Architecture 64-bit
An object-code compatible VLIW
Itanium (aka Merced) is first implementation (cf. 8086)
First customer shipment should be in 1997 , 1998, 1999, 2000, 2001
McKinley will be second implementation due 2002
IA-64 Instruction Format
Instruction 2
Instruction 1
Instruction 0
Asanovic/Devadas
Spring 2002
6.823
Template
128-bit instruction bundle
Template bits describe grouping of these
instructions with others in adjacent bundles
Each group contains instructions that can execute
in parallel
bundle j-1 bundle j
group i-1
bundle j+1 bundle j+2
group i
group i+1
group i+2
Asanovic/Devadas
Spring 2002
6.823
IA-64 Registers
128 General Purpose 64-bit Integer Registers
128 General Purpose 64/80-bit Floating Point Registers
64 1-bit Predicate Registers
GPRs rotate to reduce code size for software pipelined
loops
IA-64 Predicated Execution
Asanovic/Devadas
Spring 2002
6.823
Problem: Mispredicted branches limit ILP
Solution: Eliminate some hard to predict branches with predicated
execution
Almost all IA-64 instructions can be executed conditionally under predicate
Instruction becomes NOP if predicate register false
b0: Inst 1
Inst 2
br a==b, b2
b1: Inst 3
Inst 4
br b3
b2: Inst 5
Inst 6
if
else
Predication
then
Inst 1
Inst 2
p1,p2 <- cmp(a==b)
(p1) Inst 3 || (p2) Inst 5
(p1) Inst 4 || (p2) Inst 6
Inst 7
Inst 8
One basic block
b3: Inst 7
Inst 8
Four basic blocks
Mahlke et al, ISCA95: On average
>50% branches removed
IA-64 Speculative Execution
Asanovic/Devadas
Spring 2002
6.823
Problem: Branches restrict compiler code motion
Solution: Speculative operations that dont cause exceptions
Inst 1
Inst 2
br a==b, b2
Load r1
Use r1
Inst 3
Cant move load above branch
because might cause spurious
exception
Load.s r1
Inst 1
Inst 2
br a==b, b2
Chk.s r1
Use r1
Inst 3
Speculative load
never causes
exception, but sets
poison bit on
destination register
Check for exception in
original home block
jumps to fixup code if
exception detected
Particularly useful for scheduling long latency loads early
IA-64 Data Speculation
Asanovic/Devadas
Spring 2002
6.823
Problem: Possible memory hazards limit code scheduling
Solution: Hardware to check pointer hazards
Inst 1
Inst 2
Store
Load r1
Use r1
Inst 3
Cant move load above store
because store might be to same
address
Load.a r1
Inst 1
Inst 2
Store
Load.c
Use r1
Inst 3
Data speculative load
adds address to
address check table
Store invalidates any
matching loads in
address check table
Check if load invalid (or
missing), jump to fixup
code if so
Requires associative hardware in address check table
Asanovic/Devadas
Spring 2002
6.823
ILP Datapath Hardware Scaling
Register File
Multiple
Functional
Units
Replicating functional units and
cache/memory banks is straightforward and
scales linearly
Register file ports and bypass logic for N
functional units scale quadratically (N*N)
Memory interconnection among N functional
units and memory banks also scales
quadratically
Memory Interconnect
Multiple
Cache/Memory
Banks
(For large N, could try O(N logN) interconnect
schemes)
Technology scaling: Wires are getting even
slower relative to gate delays
Complex interconnect adds latency as well
as area
=> Need greater parallelism to hide latencies
Clustered VLIW
Cluster
Interconnect
Local
Regfile
Local
Regfile
Cluster
Memory Interconnect
Multiple
Cache/Memory
Banks
Asanovic/Devadas
Spring 2002
6.823
Divide machine into clusters of
local register files and local
functional units
Lower bandwidth/higher latency
interconnect between clusters
Software responsible for mapping
computations to minimize
communication overhead
Asanovic/Devadas
Spring 2002
6.823
Limits of Static Scheduling
Four major weaknesses of static scheduling:
Unpredictable branches
Variable memory latency (unpredictable cache misses)
Code size explosion
Compiler complexity
Asanovic/Devadas
Spring 2002
6.823
Vector Computers
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Supercomputers
Asanovic/Devadas
Spring 2002
6.823
Definition of a supercomputer:
Fastest machine in world at given task
Any machine costing $30M+
A device to turn a compute-bound problem into an
I/O bound problem
Any machine designed by Seymour Cray
CDC6600 (Cray, 1964) regarded as first supercomputer
Supercomputer Applications
Asanovic/Devadas
Spring 2002
6.823
Typical application areas
Military research (nuclear weapons, cryptography)
Scientific research
Weather forecasting
Oil exploration
Industrial design (car crash simulation)
All involve huge computations on large data sets
In 70s-80s, Supercomputer Vector Machine
Vector Supercomputers
(Epitomized by Cray-1, 1976)
Scalar Unit + Vector Extensions
Load/Store Architecture
Vector Registers
Vector Instructions
Hardwired Control
Highly Pipelined Functional Units
Interleaved Memory System
No Data Caches
No Virtual Memory
Asanovic/Devadas
Spring 2002
6.823
Cray-1 (1976)
Asanovic/Devadas
Spring 2002
6.823
Asanovic/Devadas
Spring 2002
6.823
Cray-1 (1976)
64 Element
Vector Registers
Single Port
Memory
16 banks of
64-bit words
+
8-bit SECDED
( (Ah) + j k m )
(A0)
80MW/sec data
load/store
Tjk
( (Ah) + j k m )
(A0)
320MW/sec
instruction
buffer refill
64
T Regs
Si
64
T Regs
Ai
Bjk
S0
S1
S2
S3
S4
S5
S6
S7
A0
A1
A2
A3
A4
A5
A6
A7
NIP
64-bitx16
4 Instruction Buffers
memory bank cycle 50 ns
V0
V1
V2
V3
V4
V5
V6
V7
Vi
V. Mask
Vj
V. Length
Vk
FP Add
Sj
FP Mul
Sk
FP Recip
Si
Int Add
Int Logic
Int Shift
Pop Cnt
Aj
Ak
Ai
Addr Add
Addr Mul
CIP
LIP
processor cycle 12.5 ns (80MHz)
Vector Programming Model
Scalar Registers
Vector Registers
r15
v15
r0
v0
[0]
[1]
[2]
[VLRMAX-1]
Vector Length Register
Vector Arithmetic
Instructions
VADD v3, v1, v2
VLR
v1
v2
+
[0]
[1]
v3
Vector Load and
Store Instructions
VLD v1, r1, r2
Base, r1
Asanovic/Devadas
Spring 2002
6.823
Stride, r2
v1
[VLR-1]
Vector Register
Memory
Asanovic/Devadas
Spring 2002
6.823
Vector Code Example
# C code
for (i=0; i<64; i++)
C[i] = A[i] + B[i];
# Scalar Code
li r4, #64
loop:
ld f1, 0(r1)
ld f2, 0(r2)
fadd f3, f1, f2
st f3, 0(r3)
add r1, r1, #1
add r2, r2, #1
add r3, r3, #1
sub r4, #1
bnez r4, loop
# Vector Code
li vlr, #64
lv v1, r1, #1
lv v2, r2, #1
faddv v3, v1, v2
sv v3, r3, #1
Asanovic/Devadas
Spring 2002
6.823
Vector Instruction Set Advantages
Compact
one short instruction encodes N operations
Expressive, tells hardware that these N operations:
are independent
use the same functional unit
access disjoint registers
access registers in the same pattern as previous instructions
access a contiguous block of memory (unit-stride load/store)
access memory in a known pattern (strided load/store)
Scalable
can run same object code on more parallel pipelines or lanes
Asanovic/Devadas
Spring 2002
6.823
Vector Arithmetic Execution
Use deep pipeline (=> fast clock)
to execute element operations
Simplifies control of deep pipeline
because elements in vector are
independent (=> no hazards!)
V
1
V
2
V
3
Six stage multiply pipeline
V3 <- v1 * v2
Asanovic/Devadas
Spring 2002
6.823
Vector Memory System
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency
Bank busy time: Cycles between accesses to same bank
Base Stride
Vector Registers
Address
Generator
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
Asanovic/Devadas
Spring 2002
6.823
Vector Instruction Execution
VADD C,A,B
Execution using
one pipelined
functional unit
A[6]
A[5]
A[4]
A[3]
B[6]
B[5]
B[4]
B[3]
Execution using
four pipelined
functional units
A[24]
A[20]
A[16]
A[12]
B[24]
B[20]
B[16]
B[12]
A[25]
A[21]
A[17]
A[13]
B[25]
B[21]
B[17]
B[13]
A[26]
A[22]
A[18]
A[14]
B[26]
B[22]
B[18]
B[14]
A[27]
A[23]
A[19]
A[15]
B[27]
B[23]
B[19]
B[15]
C[2]
C[8]
C[9]
C[10]
C[11]
C[1]
C[4]
C[5]
C[6]
C[7]
C[0]
C[0]
C[1]
C[2]
C[3]
Asanovic/Devadas
Spring 2002
6.823
Vector Unit Structure
Functional Unit
Vector
Registers
Elements
0, 4, 8,
Elements
1, 5, 9,
Elements
2, 6, 10,
Lane
Memory Subsystem
Elements
3, 7, 11,
Asanovic/Devadas
Spring 2002
6.823
T0 Vector Microprocessor (1995)
Vector register
elements striped
over lanes
Lane
[24][25]
[16][17]
[8] [9]
[0] [1]
[26] [27] [28]
[18] [19] [20]
[10] [11] [12]
[2] [3] [4]
[29]
[21]
[13]
[5]
[30]
[22]
[14]
[6]
[31]
[23]
[15]
[7]
Vector Memory-Memory versus
Vector Register Machines
Asanovic/Devadas
Spring 2002
6.823
Vector memory-memory instructions hold all vector operands
in main memory
The first vector machines, CDC Star-100 (73) and TI ASC (71),
were memory-memory machines
Cray-1 (76) was first vector register machine
Vector Memory-Memory Code
Example Source Code
for (i=0; i<N; i++)
{
C[i] = A[i] + B[i];
D[i] = A[i] - B[i];
}
VADD C, A, B
VSUB D, A, B
Vector Register Code
LV V1, A
LV V2, B
VADD V3, V1, V2
SV V3, C
VSUB V4, V1, V2
SV V4, D
Vector Memory-Memory vs.
Vector Register Machines
Asanovic/Devadas
Spring 2002
6.823
Vector memory-memory architectures (VMMA) require
greater main memory bandwidth, why?
All operands must be read in and out of memory
VMMAs make if difficult to overlap execution of
multiple vector operations, why?
Must check dependencies on memory addresses
VMMAs incur greater startup latency
Scalar code was faster on CDC Star-100 for vectors < 100 elements
For Cray-1, vector/scalar breakeven point was around 2 elements
Apart from CDC follow-ons (Cyber-205, ETA-10) all
major vector machines since Cray-1 have had vector
register architectures
(we ignore vector memory-memory from now on)
Automatic Code Vectorization
Asanovic/Devadas
Spring 2002
6.823
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Vectorized Code
Scalar Sequential Code
load
load
Iter. 1
add
store
load
load
Iter. 2
add
store
load
load
Time
load
Iter
1
load
add
add
store
store
Iter
2
Vector Instruction
Vectorization is a massive compile-time
reordering of operation sequencing
requires extensive loop dependence
analysis
Vector Stripmining
Asanovic/Devadas
Spring 2002
6.823
Problem: Vector registers have finite length
Solution: Break loops into pieces that fit into vector
registers, Stripmining
and r1, N, 63 #
move vlr, r1
#
loop:
lv v1, rA
Remainder
add rA, rA, r1 #
lv v2, rB
add rB, rB, r1
64 elements vadd v3, v1, v2
sv v3, rC
add rC, rC, r1
sub N, N, r1
#
move vlr, 64
#
move r1, 64
bgtz N, loop
#
for (i=0; i<N; i++)
C[i] = A[i]+B[i];
A
C
+
N mod 64
Do remainder
Bump pointer
Subtract elements
Reset full length
Any more to do?
Vector Instruction Parallelism
Asanovic/Devadas
Spring 2002
6.823
Can overlap execution of multiple vector instructions
example machine has 32 elements per vector register and 8 lanes
Load Unit
load
Multiply Unit
Add Unit
mul
add
time
load
mul
add
Instruction
issue
Complete 24 operations/cycle while issuing 1 short instruction/cycle
Vector Chaining
Asanovic/Devadas
Spring 2002
6.823
Vector version of register bypassing
introduced with Cray-1
lv
V
1
v1
V
2
V
3
V
4
vmul v3,v1,v2
vadd v5, v3, v4
Chain
Load
Unit
Memory
Chain
Mult.
Add
V
5
Vector Chaining Advantage
Asanovic/Devadas
Spring 2002
6.823
Without chaining, must wait for last element of result to
be written before starting dependent instruction
Load
Mul
Time
Add
With chaining, can start dependent instruction as soon
as first result appears
Load
Mul
Add
Asanovic/Devadas
Spring 2002
6.823
Vector Startup
Two components of vector startup penalty
functional unit latency (time through pipeline)
dead time or recovery time (time before another vector
instruction can start down pipeline)
Functional Unit Latency
R
First Vector Instruction
Dead Time
Dead Time
Second Vector Instruction
W
Asanovic/Devadas
Spring 2002
6.823
No Dead Time => Shorter Vectors
No dead time
4 cycles dead time
64 cycles active
Cray C90, Two lanes
4 cycle dead time
Maximum efficiency 94%
with 128 element vectors
T0, Eight lanes
No dead time
100% efficiency with 8 element
vectors
Vector Scatter/Gather
Asanovic/Devadas
Spring 2002
6.823
Want to vectorize loops with indirect accesses:
for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]
Indexed load instruction (Gather)
lv vD, rD
lvx vC, rC, vD
lv vB, rB
vadd vA, vB, vC
sv vA, rA
#
#
#
#
#
Load indices in D vector
Load indirect from rC base
Load B vector
Do add
Store result
Vector Scatter/Gather
Asanovic/Devadas
Spring 2002
6.823
Scatter example:
for (i=0; i<N; i++)
A[B[i]]++;
Is following a correct translation?
lv vB, rB
lvx vA, rA, vB
vadd vA, vA, 1
svx vA, rA, vB
#
#
#
#
Load indices in B vector
Gather initial A values
Increment
Scatter incremented values
Vector Conditional Execution
Asanovic/Devadas
Spring 2002
6.823
Problem: Want to vectorize loops with conditional code:
for (i=0; i<N; i++)
if (A[i]>0) then
A[i] = B[i];
else
A[i] = C[i];
Solution: Add vector mask (or flag) registers
vector version of predicate registers, 1 bit per element
and maskable vector instructions
vector operation becomes NOP at elements where mask bit is clear
Code example:
lv vA, rA
# Load A vector
mgtz m0, vA
# Set bits in mask register m0 where A>0
lv.m vA, rB, m0 # Load B vector into A under mask
fnot m1, m0
# Invert mask register
lv.m vA, rC, m1 # Load C vector into A under mask
sv vA, rA
# Store A back to memory (no mask)
Masked Vector Instructions
Simple Implementation
execute all N operations, turn off
result writeback according to mask
M[7]=1
M[6]=0
M[5]=1
M[4]=1
M[3]=0
A[7]
A[6]
A[5]
A[4]
A[3]
B[7]
B[6]
B[5]
B[4]
B[3]
M[2]=0
C[2]
M[1]=1
C[1]
Asanovic/Devadas
Spring 2002
6.823
Density-Time Implementation
scan mask vector and only execute
elements with non-zero masks
M[7]=1
M[6]=0
M[5]=1
M[4]=1
M[3]=0
M[2]=0
M[1]=1
M[0]=0
A[7]
B[7]
C[5]
C[4]
C[1]
Write data port
M[0]=0
Write Enable
C[0]
Write data port
Compress/Expand Operations
Asanovic/Devadas
Spring 2002
6.823
Compress packs non-masked elements into
contiguous vector
population count of mask vector gives packed vector length
Expand performs inverse operation
M[7]=1
M[6]=0
M[5]=1
M[4]=1
M[3]=0
M[2]=0
M[1]=1
M[0]=0
A[7]
A[6]
A[5]
A[4]
A[3]
A[2]
A[1]
A[0]
A[7]
A[5]
A[4]
A[1]
A[7]
A[5]
A[4]
A[1]
Compress
A[7]
B[6]
A[5]
A[4]
B[3]
B[2]
A[1]
B[0]
M[7]=1
M[6]=0
M[5]=1
M[4]=1
M[3]=0
M[2]=0
M[1]=1
M[0]=0
Expand
Used for density-time conditionals and also for general
selection operations
Vector Reductions
Asanovic/Devadas
Spring 2002
6.823
Problem: Loop-carried dependence on reduction variables
sum = 0;
for (i=0; i<N; i++)
sum += A[i]; # Loop-carried dependence on sum
Solution: Re-associate operations if possible, use binary
tree to perform reduction
# Rearrange as:
sum[0:VL-1] = 0
#
for(i=0; i<N; i+=VL)
#
sum[0:VL-1] += A[i:i+VL-1]; #
# Now have VL partial sums in one
do {
VL = VL/2;
sum[0:VL-1] += sum[VL:2*VL-1]
} while (VL>1)
Vector of VL partial sums
Stripmine VL-sized chunks
Vector sum
vector register
# Halve vector length
# Halve no. of partials
A Modern Vector Super: NEC SX-5 (1998)
Asanovic/Devadas
Spring 2002
6.823
CMOS Technology
250MHz clock (312 MHz in 2001)
CPU fits on one multi-chip module
SDRAM main memory (up to 128GB)
Scalar unit
4-way superscalar with out-of-order and speculative execution
64KB I-cache and 64KB data cache
Vector unit
8 foreground VRegs + 64 background VRegs (256 elements/VReg)
1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit
16 lanes, 8 GFLOPS peak (32 FLOPS/cycle)
1 load & store unit (32x8 byte accesses/cycle)
64 GB/s memory bandwidth per processor
SMP structure
16 CPUs connected to memory through crossbar
1 TB/s shared memory bandwidth
Multimedia Extensions
Asanovic/Devadas
Spring 2002
6.823
Very short vectors added to existing ISAs for micros
Usually 64-bit registers split into 2x32b or 4x16b or 8x8b
Newer designs have 128-bit registers (Altivec, SSE2)
Limited instruction set:
no vector length control
no strided load/store or scatter/gather
unit-stride loads must be aligned to 64/128-bit boundary
Limited vector register length:
requires superscalar dispatch to keep multiply/add/load units busy
loop unrolling to hide latencies increases register pressure
Trend towards fuller vector support in microprocessors
Symmetric Multiprocessors:
Synchronization and Sequential
Consistency
Page 1
Symmetric Multiprocessors
Processor
Processor
CPU-Memory bus
bridge
I/O bus
Memory
I/O controller I/O controller I/O controller
Graphics
output
symmetric
Networks
All memory is equally far
away from all processors
Any processor can do any I/O
(set up a DMA transfer)
Page 2
Synchronization
The need for synchronization arises whenever
there are parallel processes in a system
fork
(even in a uniprocessor system)
Forks and Joins: In parallel programming
a parallel process may want to wait until
several events have occurred
Producer-Consumer: A consumer process
must wait until the producer process has
produced data
Exclusive use of a resource: Operating
system has to ensure that only one
process uses a resource at a given time
P2
P1
join
producer
consumer
3
Page 3
A Producer-Consumer Example
Producer
tail
head
Consumer
Rtail
Rtail
Rhead
Consumer:
Rhead M[head]
3
spin: Rtail M[tail]
if <Rhead> == <Rtail>
4
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead>
process(R)
Producer posting Item x:
Rtail M[tail]
M[<Rtail>] x
1
Rtail <Rtail> + 1
2
M[tail] <Rtail>
The program is written assuming
instructions are executed in order.
Possible problems?
What is the problem?
Suppose the tail pointer gets updated before the item x is stored?
Suppose R is loaded before x has been stored?
Page 4
A Producer-Consumer Example
Consumer:
Rhead M[head]
3
spin: Rtail M[tail]
if <Rhead> == <Rtail>
4
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead >
process(R)
Producer posting Item x:
Rtail M[tail]
M[<Rtail>] x
1
Rtail <Rtail> + 1
2
M[tail] <Rtail >
Programmer assumes that if 3 happens after 2, then 4 happens
after 1.
Problems are:
Sequence 2, 3, 4, 1
Sequence 4, 1, 2, 3
Programmer assumes that if 3 happens after 2, then 4 happens after 1.
Page 5
Sequential Consistency: A Memory Model
P
A system is sequentially consistent if the result of
any execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual processor
appear in the order specified by the program
Leslie Lamport
Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs
Page 6
Sequential Consistency
Concurrent sequential tasks: T1, T2
Shared variables: X, Y (initially X = 0, Y = 10)
T1:
T2:
Store(X, 1) (X = 1)
Store(Y, 11) (Y = 11)
Load(R1, Y)
Store(B, R1) (B = Y)
Load(R2, X)
Store(A, R2) (A = X)
what are the legitimate answers for A and B ?
(A, B) { (1, 11), (0, 10), (1, 10), (0, 11) } ?
7
(0, 11) is not legit.
Page 7
Sequential Consistency
Sequential consistency imposes additional memory
ordering constraints in addition to those imposed by
uniprocessor program dependencies
What are these in our example ?
Does (can) a system with caches, write buffers, or
out-of-order execution capability provide a
sequentially consistent view of the memory ?
More on this later
Page 8
Multiple Consumer Example
Producer
tail
Consumer1 Rtail
head
Rhead
Rtail
Consumer2
Producer posting Item x:
Rtail M[tail]
M[<Rtail>] x
Rtail <Rtail> + 1
M[tail] <Rtail >
What is wrong with this
Rtail
Rhead
Consumer:
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail>
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead >
code? process(R)
Page 9
Multiple Consumer Example
Producer
tail
head
Consumer1
Consumer2
Consumer:
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail>
R M[<Rhead>]
Rhead <Rhead> + 1
Critical Section:
M[head] <Rhead >
Needs to be executed atomically
process(R)
by one consumer locks
Producer posting Item x:
Rtail M[tail]
M[<Rtail>] x
Rtail <Rtail> + 1
M[tail] <Rtail >
10
Page 10
10
Locks or Semaphores:
E. W. Dijkstra, 1965
A semaphore is a non-negative integer, with the
following operations:
P(s): if s > 0 decrement s by 1 otherwise wait
V(s): increment s by 1 and wake up one of
the waiting processes
Ps and Vs must be executed atomically, i.e., without
interruptions or
interleaved accesses to s by other processors
Process i
P(s)
What does initial value of s
<critical section> determine?
V(s)
11
The maximum number of processes in the critical section.
A sempahore is a visual system for sending information based on 2 flags
held
In each hand.
Page 11
11
Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented
using ordinary Load and Store instructions in the
Sequential Consistency memory model. However,
protocols for mutual exclusion are difficult to design...
Simpler solution:
atomic read-modify-write instructions
Examples: (a is a memory address, R is a register)
Test&Set(a, R):
R M[a];
if <R>==0 then
M[a] 1;
Fetch&Add(a, RV, R):
R M[a];
M[a] <R> + <RV>;
Swap(a, R):
Rt M[a];
M[a] <R>;
R <Rt>;
12
Page 12
12
Multiple Consumers Example:
using the Test & Set Instruction
P:
Test&Set(mutex, Rtemp)
if (<Rtemp> != 0) goto P
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail> goto spin
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead >
Store(mutex, 0)
V:
process(R)
Critical
Section
Other atomic read-modify-write instructions (Swap,
Fetch&Add, etc.) can also implement Ps and Vs
What is the problem with this code?
13
What if the process stops or is swapped out while in the critical section?
Page 13
13
Nonblocking Synchronization
Compare&Swap(a, Rt, Rs): implicit arg - status
if (<Rt> == M[a])
then
M[a] <Rs>;
Rt <Rs>;
status success;
else
status fail;
try: Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail> goto spin
R M[<Rhead>]
Rnewhead <Rhead> + 1
Compare&Swap(head, Rhead, Rnewhead)
if (status == fail) goto try
process(R)
14
Page 14
14
Load-reserve & Store-conditional
Non-blocking Synchronization
Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve(R, a):
<flag, adr> <1, a>;
R M[a];
try:
spin:
Store-conditional(a, R):
if <flag, adr> == <1, a>
then cancel other procs
reservation on a;
M[a] <R>;
status succeed;
else status fail;
Load-reserve(Rhead, head)
Rtail M[tail]
if <Rhead > == <Rtail> goto spin
R M[<Rhead>]
Rhead <Rhead> + 1
Store-conditional(head, Rhead)
if (status == fail) goto try
process(R)
15
Page 15
15
Mutual Exclusion Using Load/Store
A protocol based on two shared variables c1 and
c2. Initially, both c1 and c2 are 0 (not busy)
Process 1
...
c1 = 1;
L: if c2 == 1 then go to L
< critical section >
c1 = 0;
Process 2
...
c2 = 1;
L: if c1 == 1 then go to L
< critical section >
c2 = 0;
What is wrong?
16
Page 16
16
Mutual Exclusion: second attempt
To avoid deadlock, let process give up reservation
(i.e., Process 1 sets c1 to 0) while waiting.
Process 1
...
L: c1 = 1;
if c2 == 1 then
{ c1 = 0; goto L }
< critical section >
c1 = 0
Process 2
...
L: c2 = 1;
if c1 == 1 then
{ c2 = 0; goto L }
< critical section >
c2 = 0
Deadlock is not possible.
What could go wrong?
17
This is the most promising solution, but alas, we still have a problem with
bounded waiting . Suppose Process j continually reenters its entry protocol
after leaving its exit protocol, while Process i is waiting. It is possible That
Process j will repeatedly reach the while test when Process i has
temporarily cleared its flag. We cannot place a bound on how many times
this could happen.
Page 17
17
A Protocol for Mutual Exclusion
T. Dekker, 1966
A protocol based on 3 shared variables c1, c2 and
turn. Initially, both c1 and c2 are 0 (not busy)
Process 1
...
c1 = 1;
turn = 1;
L: if c2 == 1 && turn == 1
then goto L
< critical section >
c1 = 0;
Process 2
...
c2 = 1;
turn = 2;
L: if c1 == 1 && turn == 2
then goto L
< critical section>
c2 = 0;
turn = i ensures that only process i can wait
variables c1 and c2 ensure mutual exclusion
18
Take a number approach used in bakeries.
Never seen one in bakeries, but the RMV uses one.
Page 18
18
N-process Mutual Exclusion
Lamports Bakery Algorithm
Process i
Initially num[j] = 0, for all j
Entry Code
choosing[i] = 1;
num[i] = max(num[0], , num[N-1]) + 1;
choosing[i] = 0;
for(j = 0; j < N; j++) {
while( choosing[j] );
while( num[j] &&
( ( num[j] < num[i] ) ||
( num[j] == num[i] && j < i ) ) );
}
Exit Code
num[i] = 0;
19
Wait if the process is currently choosing
Wait if the process has a number and comes ahead of us.
Page 19
19
Implementation Issues
P
Implementation of SC is complicated by two issues
Out-of-order execution capability
Load(a); Load(b)
yes
Load(a); Store(b)
yes if a b
Store(a); Load(b)
yes if a b
Store(a); Store(b)
yes if a b
Caches
Caches can prevent the effect of a store from
being seen by other processors
20
Page 20
20
Memory Fences:
Instructions to serialize memory accesses
Processors with relaxed or weak memory models
(i.e., permit Loads and Stores to different addresses
to be reordered) need memory fence instructions to
force serialization of memory accesses
Processors with relaxed memory models:
Sparc V8 (TSO, PSO): Membar
PowerPC (WO): Sync, EIEIO
Memory fences are expensive operations, however,
one pays for serialization only when it is required
21
Page 21
21
Using Memory Fences
Producer
tail
Producer posting Item x:
Rtail M[tail]
M[<Rtail>] x
membarSS
Rtail = <Rtail> + 1
M[tail] <Rtail >
What does this ensure?
head
Consumer
Consumer:
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail>
membarLL
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead >
What does
process(R)
this ensure?
22
Ensures that tail pointer is not updated before
X has been stored.
Ensures that R is not loaded before x has been stored.
Page 22
22
Cache Coherence
Page 1
Memory Consistency in SMPs
CPU-1
CPU-2
cache-1
100
100
cache-2
CPU-Memory bus
100
memory
Suppose CPU-1 updates A to 200.
write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value
Do these stale values matter?
What is the view of shared memory for programming?
2
Page 2
Write-back Caches & SC
T1 is executed
prog T1
ST X, 1
ST Y,11
cache-1 writes back Y
T2 executed
cache-1 writes back X
cache-2 writes
back X & Y
cache-1
X= 1
Y=11
memory
X= 0
Y =10
X=
Y=
cache-2
Y=
Y=
X=
X=
X= 1
Y=11
X= 0
Y =11
X=
Y=
Y=
Y=
X=
X=
X= 1
Y=11
X= 0
Y =11
X=
Y=
Y = 11
Y= 11
X= 0
X= 0
X= 1
Y=11
X= 1
Y =11
X=
Y=
Y = 11
Y= 11
X= 0
X= 0
X= 1
Y=11
X= 1
Y =11
X= 0
Y=11
Y=
Y=
X=
X=
prog T2
LD Y, R1
ST Y, R1
LD X, R2
ST X,R2
Page 3
Write-through Caches & SC
prog T1
ST X, 1
ST Y,11
T1 executed
T2 executed
cache-1
X= 0
Y=10
memory
X= 0
Y =10
X=
Y=
cache-2
Y=
Y=
X= 0
X=
X= 1
Y=11
X= 1
Y =11
X=
Y=
Y=
Y=
X= 0
X=
X= 1
Y=11
X= 1
Y =11
X= 0
Y= 11
Y = 11
Y= 11
X= 0
X= 0
prog T2
LD Y, R1
ST Y, R1
LD X, R2
ST X,R2
Write-through caches dont preserve
sequential consistency either
Page 4
Maintaining Sequential Consistency
SC sufficient for correct producer-consumer
and mutual exclusion code (e.g., Dekker)
Multiple copies of a location in various caches
can cause SC to break down.
Hardware support is required such that
only one processor at a time has write
permission for a location
no processor can load a stale copy of
the location after a write
cache coherence protocols
Page 5
A System with Multiple Caches
P
L1
P
L1
P
L1
P
L1
L2
P
L1
P
L1
L2
Interconnect
M
Modern systems often have hierarchical caches
Each cache has exactly one parent but can have
zero or more children
Only a parent and its children can communicate directly
Inclusion property is maintained between a parent
and its children, i.e.,
a Li+1
a Li
6
Page 6
Cache Coherence Protocols for SC
write request:
the address is invalidated (updated) in all other
caches before (after) the write is performed
read request:
if a dirty copy is found in some cache, a write-back
is performed before the memory is read
We will focus on Invalidation protocols
as opposed to Update protocols
7
Update protocols, or write broadcast. Latency between writing a word in one
processor
and reading it in another is usually smaller in a write update scheme.
But since bandwidth is more precious, most multiprocessors use a write
invalidate scheme.
Page 7
7
Warmup: Parallel I/O
Memory Physical
Bus
Memory
Address (A)
Proc.
Data (D)
Cache
R/W
Page transfers
occur while
Processor runs
A
Either Cache or DMA can
be the Bus Master and
effect transfers
D
R/W
DMA
DISK
DMA stands for Direct Memory Access
8
Page 8
Problems with Parallel I/O
Cached portions
of page
Physical
Memory
Memory
Bus
Proc.
Cache
DMA transfers
DMA
DISK
Memory
Disk
Disk: Physical memory may be
stale if Cache is
.
Memory: Cache may have data
corresponding to
.
9
Page 9
Snoopy Cache [ Goodman 1983 ]
Idea: Have cache watch (or snoop upon) DMA
transfers, and then do the right thing
Snoopy cache tags are dual-ported
Used to drive Memory Bus
when Cache is Bus Master
A
Proc.
R/W
D
Tags and
State
Data
(lines)
A
R/W
Snoopy read port
attached to Memory
Bus
Cache
10
A snoopy cache works in analogy to your snoopy next door neighbor, who is
always watching to see what you're doing, and interfering with your life. In
the case of the snoopy cache, the caches are all watching the bus for
transactions that affect blocks that are in the cache at the moment. The
analogy breaks down here; the snoopy cache only does something if your
actions actually affect it, while the snoopy neighbor is always interested in
what you're up to.
Page 10
10
Snoopy Cache Actions
Observed Bus
Cycle
Cache State
Cache Action
Address not cached
Read Cycle
Memory
Disk
Cached, unmodified
Disk
Memory
Cached, modified
Address not cached
Write Cycle
Cached, unmodified
Cached, modified
11
Page 11
11
Shared Memory Multiprocessor
Memory
Bus
M1
Snoopy
Cache
M2
Snoopy
Cache
M3
Snoopy
Cache
Physical
Memory
DMA
DISKS
Use snoopy mechanism to keep all
processors view of memory coherent
12
Page 12
12
Cache State Transition Diagram
M: Modified Exclusive
E: Exclusive, unmodified
S: Shared
I: Invalid
Each cache line tag
Address tag
state
bits
M1 write
or read
M1 write
Write miss
Other processor
read/ Write back line
Read
miss
Read by any
processor
M1 read
M1
o
tt
en
t
in
e
rit
Other processor
intent to write
Other processor
intent to write
I
13
Page 13
13
2 Processor Example
Block b
M1
Block b
M2
M1 write
or read
M1 write
M2 read,
Write back line
Read
miss
M2 write
or read
M1
to
M2 write
M2
nt
en
int
Write miss
ite
wr
M2 intent to write
M1 read,
Write back line
Read
miss
e
int
wr
t to
M2 intent to write
M2 read
Write miss
ite
M1 intent to write
M1 read
M1 intent to write
14
Page 14
14
Observation
M1 write
or read
M1 write
Read by any
processor
M1 read
Write miss
Other processor
read/ Write back line
Read
miss
M1
to
nt
e
t
in
e
rit
Other processor
intent to write
Other processor
intent to write
If a line is in M or E, then no other cache has
a valid copy of the line!
Memory stays coherent, multiple differing copies
cannot exist
15
Page 15
15
Cache Coherence State Encoding
block Address
tag
indexm
offset
tag
V M
data block
Valid and dirty bits can be used
to encode S, I, and (E, M) states
V=0, D=x Invalid
V=1, D=0 Shared (not dirty)
V=1, D=1 Exclusive (dirty)
Hit?
word
16
What does it mean to merge E, M states?
Page 16
16
2-Level Caches
CPU
CPU
CPU
CPU
L1 $
L1 $
L1 $
L1 $
L2 $
L2 $
L2 $
L2 $
Snooper
Snooper
Snooper
Snooper
Processors often have two-level caches
Small L1 on chip, large L2 off chip
Inclusion property: entries in L1 must be in L2
invalidation in L2 invalidation in L1
Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?
17
Interlocks are required when both CPU-L1 and L2-Bus interactions involve
the same address.
Page 17
17
Intervention
CPU-1
A
CPU-2
cache-1
200
cache-2
CPU-Memory bus
A
100
Memory (stale)
When a read-miss for A occurs in cache-2,
a read request for A is placed on the bus
Cache-1 needs to supply & change its state to shared
The memory may respond to the request also!
does memory know it has stale data?
Cache-1 needs to intervene through memory
controller to supply correct data to cache-2
18
Page 18
18
False Sharing
state blk addr data0 data1
...
dataN
A cache block contains more than one word
Cache-coherence is done at the block-level and not
word-level
Suppose M1 writes wordi and M2 writes wordk and
both words have the same block address.
What can happen?
19
The block may be invalidated many times unnecessarily because
the addresses share a common block.
Page 19
19
Synchronization and Caches:
Performance Issues
Processor 1
R1
L: swap(mutex, R);
if <R> then goto L;
<critical section>
M[mutex] 0;
Processor 2
R1
L: swap(mutex, R);
if <R> then goto L;
<critical section>
M[mutex] 0;
cache
mutex=1
Processor 3
R1
L: swap(mutex, R);
if <R> then goto L;
<critical section>
M[mutex] 0;
cache
CPU-Memory Bus
Cache-coherence protocols will cause mutex to ping-pong
between P1s and P2s caches.
Ping-ponging can be reduced by first reading the mutex
location (non-atomically) and executing a swap only if
it is found to be zero.
20
Page 20
20
Performance Related to Bus occupancy
In general, a read-modify-write instruction requires
two memory (bus) operations without intervening
memory operations by other processors
In a multiprocessor setting, bus needs to be locked
for the entire duration of the atomic read and write
operation
expensive for simple buses
very expensive for split-transaction buses
modern processors use
load-reserve
store-conditional
21
Split transaction bus has a read-request transaction followed by a
Memory-reply transaction that contains the data.
Split transactions make the bus available for other masters
While the memory reads the words of the requested address.
It also normally means that the CPU must arbitrate for the bus
To request the data and memory must arbitrate for the bus to
Return the data. Each transaction must be tagged. Split
Transaction buses have higher bandwidth and higher latency.
Page 21
21
Load-reserve & Store-conditional
Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve(R, a):
<flag, adr> <1, a>;
R M[a];
Store-conditional(a, R):
if <flag, adr> == <1, a>
then cancel other procs
reservation on a;
M[a] <R>;
status succeed;
else status fail;
If the snooper sees a store transaction to the address
in the reserve register, the reserve bit is set to 0
Several processors may reserve a simultaneously
These instructions are like ordinary loads and stores
with respect to the bus traffic
22
Page 22
22
Performance:
Load-reserve & Store-conditional
The total number of memory (bus) transactions is
not necessarily reduced, but splitting an atomic
instruction into load-reserve & store-conditional:
increases bus utilization (and reduces
processor stall time), especially in splittransaction buses
reduces cache ping-pong effect because
processors trying to acquire a semaphore do
not have to perform a store each time
23
Page 23
23
Out-of-Order Loads/Stores & CC
snooper
(Wb-req, Inv-req, Inv-rep)
load/store
buffers
CPU
Cache
(I/S/E)
Blocking caches
pushout (Wb-rep)
Memory
(S-rep, E-rep)
(S-req, E-req)
One request at a time + CC SC
Non-blocking caches
Multiple requests (different addresses) concurrently + CC
Relaxed memory models
CC ensures that all processors observe the same order
of loads and stores to an address
24
Page 24
24
Relaxed Memory Models
Episode III in our multiprocessing miniseries.
Relaxed memory models.
What I really wanted here was an elephant with sunglasses relaxing
On a beach, but I couldnt find one.
Page 1
1
Sequential Consistency: A Memory Model
P
Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs
SC is easy to understand but architects and
compiler writers want to violate it for performance
2
Mark Hill written a paper which essentially says why break your back for
20%. Actually people are out there breaking their backs for 1% in
architecture these days.
Page 2
Optimizations & Memory Models
Pushout
buffers
load/store
buffers
Memory
Data
Cache
CPU
ProcessorMemory
Interface
Load queue
Architectural optimizations that are correct for
uniprocessors often result in a new memory
model for multiprocessors
3
This means that we are relaxing the ordering or relaxing atomicity.
Page 3
Relaxed Models
What orderings among reads and writes
performed by a single processor are
preserved by the model?
R R, R W, W W, W R
dependence if they are to the same address
If there is a dependence, then program
semantics demand that operations be ordered
If there is no dependence, the memory
consistency model determines what orders
must be preserved
Relaxed model may allow an operation
executed later to complete first
4
Page 4
Memory Fences & Weak Memory Models
Processors with relaxed or weak memory models
need memory fence instructions to force serialization
of memory accesses
In SC there is an implicit fence at each memory
operation
Processors with relaxed memory models:
Sparc V8 (TSO, PSO): Membar
PowerPC (WO): Sync, EIEIO
Memory fences are expensive operations, however,
one pays for serialization only when it is required
5
Page 5
Using Memory Fences
Producer
tail
head
Consumer
Producer posting Item x:
Rtail M[tail]
M[<Rtail>] x
membarSS
Rtail <Rtail> + 1
M[tail] <Rtail>
Ensures that tail pointer
is not updated before x
has been stored.
Consumer:
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail>
membarLL
R M[<Rhead>]
Rhead <Rhead> + 1
Ensures that R is M[head] <Rhead>
not loaded before process(R)
x has been stored.
Ensures that tail pointer is not updated before
X has been stored.
Ensures that R is not loaded before x has been stored.
Page 6
Data-Race Free Programs
(a.k.a. Properly Synchronized Programs)
Process 1
...
Acquire(mutex);
< critical section >
Release(mutex);
Process 2
...
Acquire(mutex);
< critical section >
Release(mutex);
Synchronization variables (e.g., mutex) are separate
from data variables
Accesses to writable shared data variables are
protected in critical regions
no data races except for locks
(Formal definition is elusive)
In general, it cannot be proven if a program is datarace free.
Nondeterminator.
Page 7
Fences in Data-Race Free Programs
Process 1
...
Acquire(mutex);
membar;
< critical section >
membar;
Release(mutex);
Process 2
...
Acquire(mutex);
membar;
< critical section >
membar;
Release(mutex);
Relaxed memory models allow reordering of
instructions by the compiler or the processor as long
as the reordering is not done across a fence
What about speculation and prefetching?
8
Processor should not speculate or prefetch across fences.
Page 8
Total Store Order (TSO)
IBM370, DECVAX
Eliminates the order W(a) R(b) a b
Advantage?
=A
=A
B=
B=
acquire (S)
acquire (S)
SC
C=
=D
TSO
C=
=D
release (S)
release (S)
E=
E=
F=
F=
Allows the buffering of writes with bypassing by reads, which occurs
whenever the
processor allows a read to proceed before it guarantees that
an earlier write by the processor has been seen by all the other processors.
Page 9
9
TSO vs. SC
Initially x = old, y = old
Processor P1
Processor P2
x = new;
y_copy = y;
y = new;
x_copy = x;
Under SC what values can x_copy and y_copy get ?
Under TSO what values can x_copy and y_copy get ?
10
TSO both can get old values.
SC at least one has to get the value of new.
Page 10
10
Partial Store Ordering (PSO)
SPARC
Also eliminates the order W(a) W(b) a b
Advantage?
=A
=A
B=
B=
acquire (S)
acquire (S)
TSO
C=
PSO
C=
=D
=D
release (S)
release (S)
E=
E=
F=
F=
11
Allows pipelining or overlapping of write operations, rather than
Forcing one operation to complete before another.
Page 11
11
Weak Ordering
POWERPC
Also eliminates the orders R(a) R(b) a b
and R(a) W(b) a b
Need non-blocking reads to exploit relaxation
=A
=A
B=
B=
acquire (S)
acquire (S)
PSO
C=
WO
C=
=D
=D
release (S)
release (S)
E=
E=
F=
F=
12
Non-blocking reads, doesnt help too much.
Page 12
12
Release Consistency
Alpha, MIPS
Read/write that precedes acquire need not
complete before acquire, and read/write that
follows a release need not wait for the release
=A
=A
B=
B=
acquire (S)
acquire (S)
SC
C=
=D
RC
C=
=D
release (S)
release (S)
E=
E=
F=
F=
13
Weakest of the memory models used these days.
Page 13
13
Release Consistency Example
Initially data = old
Processor P1
Processor P2
data = new;
while(flag != SET) { }
flag = SET;
data_copy = data;
How do we ensure that data_copy is always set
to new ?
14
Membar Store Store
Membar Load Load
Page 14
14
Weaker Memory Models
Alpha, Sparc
PowerPC, ...
Store is globally
performed
TSO, PSO,
RMO, ...
Writebuffers
RMO=WO?
SMP, DSM
Hard to understand and remember
Unstable - Modle de lanne
15
Weak ordering. Interaction with cache coherence. Weak atomicity.
Page 15
15
Why SC is not the right model for compilers
Intermediate representation is intrinsically a partial
order (data-flow graph)
expose scope for instruction reordering
to the underlying architecture
Load/Store atomicity forces compilers to overspecify requirements for completion of operations
expose cache coherence actions
16
Page 16
16
The CRF Model
X. Shen, Arvind, L. Rudolph (1999)
proc
proc
sache
sache
proc
...
sache
shared memory
Exposes
data caching via semantic caches
Store(a,v)
Load(a)
StoreL(a,v); Commit(a)
Reconcile(a); LoadL(a)
instruction reordering (controllable via Fence)
17
Page 17
17
CRF: Load Local & Store Local
proc
proc
...
LoadL(a)
Cell(a,v,-)
StoreL(a,v)
Cell(a,v,D)
shared memory
LoadL reads from the sache if the address is cached
StoreL writes into the sache and sets the state to Dirty
18
Page 18
18
CRF: Commit and Reconcile
proc
Commit(a)
C
Cell(a,-,D)?
proc
Reconcile(a)
...
Cell(a,-,C)?
shared memory
Commit completes if the address is not cached in the
Dirty state
Reconcile completes if the address is not cached in
Clean
19
Page 19
19
CRF: Background Operations
proc
proc
...
Cell(a,5,C)
proc
C
Cell(b,8,D)
Cache
Writeback
Cell(a,5)
Cell(b,8)
...
Cell(c,7,C)
Purge
Cell(c,7)
Cache (retrieve) a copy of an uncached address
from memory
Writeback a Dirty copy to memory and set its
state Clean
Purge a Clean copy
20
Page 20
20
CRF: Fences
Instructions can be reordered except for
Data dependence
StoreL(a,v); Commit(a);
Reconcile(a); LoadL(a);
Reconcile(a1);
LoadL(a1);
Fencerr (a1, a2)
Reconcile(a2);
LoadL(a2);
Fencewr; Fencerw; Fenceww;
21
Page 21
21
Producer-Consumer Synchronization
reader Reconcile(a);
LoadL(a);
StoreL(a,v); writer
Commit(a);
writeback
cache
memory
Break down the synchronization equally between
the producer and consumer
Semantically, memory behaves as the rendezvous
between processors
no operation involves more than one sache
22
Page 22
22
CRF: A Universal Memory Model
SC Program
Load
Store
CRF Program
Translation
Scheme
LoadL
StoreL
Commit
Reconcile
CRF
Protocol
A CRF protocol is automatically a protocol for
any memory model whose programs can be
translated into CRF programs
23
Page 23
23
Translating SC into CRF
Initially a = 0, flag = 0
Processor 1
Processor 2
Store(a,10);
Store(flag,1);
L: r1 = Load(flag);
Jz(r1,L);
r2 = Load(a);
24
Page 24
24
Translating SC into CRF
Processor 1
Store(a,10);
Fenceww(a, flag);
Store(flag,1);
Processor 2
L: r1 = Load(flag);
Jz(r1,L);
Fencerr(flag, a);
r2 = Load(a);
Weak ordering
25
Page 25
25
Translating SC into CRF
Processor 1
Processor 2
StoreL(a,10);
Commit(a);
Fenceww(a, flag);
StoreL(flag,1);
Commit(flag);
L: Reconcile(flag);
r1 = LoadL(flag);
Jz(r1,L);
Fencerr(flag, a);
Reconcile(a);
r2 = LoadL(a);
26
Page 26
26
Asanovic/Devadas
Spring 2002
6.823
Microprocessor Evolution:
4004 to Pentium Pro
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
First Microprocessor
Intel 4004, 1971
Asanovic/Devadas
Spring 2002
6.823
4-bit accumulator
architecture
8m pMOS
2,300 transistors
3 x 4 mm2
750kHz clock
8-16 cycles/inst.
Asanovic/Devadas
Spring 2002
6.823
Microprocessors in the Seventies
Initial target was embedded control
First micro, 4-bit 4004 from Intel, designed for a desktop
printing calculator
Constrained by what could fit on single chip
Single accumulator architectures
8-bit micros used in hobbyist personal computers
Micral, Altair, TRS-80, Apple-II
Little impact on conventional computer market until
VISICALC spreadsheet for Apple-II (6502, 1MHz)
First killer business application for personal computers
Asanovic/Devadas
Spring 2002
6.823
DRAM in the Seventies
Dramatic progress in MOSFET memory technology
1970, Intel introduces first DRAM (1Kbit 1103)
1979, Fujitsu introduces 64Kbit DRAM
=> By mid-Seventies, obvious that PCs would soon
have > 64KBytes physical memory
Microprocessor Evolution
Asanovic/Devadas
Spring 2002
6.823
Rapid progress in size and speed through 70s
Fueled by advances in MOSFET technology and expanding markets
Intel i432
Most ambitious seventies micro; started in 1975 - released 1981
32-bit capability-based object-oriented architecture
Instructions variable number of bits long
Severe performance, complexity, and usability problems
Intel 8086 (1978, 8MHz, 29,000 transistors)
Stopgap 16-bit processor, architected in 10 weeks
Extended accumulator architecture, assembly-compatible with 8080
20-bit addressing through segmented addressing scheme
Motorola 68000 (1979, 8MHz, 68,000 transistors)
Heavily microcoded (and nanocoded)
32-bit general purpose register architecture (24 address pins)
8 address registers, 8 data registers
Intel 8086
Asanovic/Devadas
Spring 2002
6.823
Class
Data:
Register
AX,BX
CX
DX
Purpose
general purpose
string and loop ops only
mult/div and I/O only
Address:
SP
BP
SI,DI
stack pointer
base pointer (can also use BX)
index registers
Segment:
CS
SS
DS
ES
code segment
stack segment
data segment
extra segment
Control:
IP
FLAGS
instruction pointer (lower 16 bit of PC)
C, Z, N, B, P, V and 3 control bits
Typical format R R op M[X], many addressing modes
Not a GPR organization!
IBM PC, 1981
Asanovic/Devadas
Spring 2002
6.823
Hardware
Team from IBM building PC prototypes in 1979
Motorola 68000 chosen initially, but 68000 was late
IBM builds stopgap prototypes using 8088 boards from
Display Writer word processor
8088 is 8-bit bus version of 8086 => allows cheaper system
Estimated sales of 250,000
100,000,000s sold
Software
Microsoft negotiates to provide OS for IBM. Later buys and
modifies QDOS from Seattle Computer Products.
Open System
Standard processor, Intel 8088
Standard interfaces
Standard OS, MS-DOS
IBM permits cloning and third-party software
The Eighties:
Microprocessor Revolution
Asanovic/Devadas
Spring 2002
6.823
Personal computer market emerges
Huge business and consumer market for spreadsheets, word
processing and games
Based on inexpensive 8-bit and 16-bit micros: Zilog Z80, Mostek
6502, Intel 8088/86,
Minicomputers replaced by workstations
Distributed network computing and high-performance graphics
for scientific and engineering applications (Sun, Apollo, HP,)
Based on powerful 32-bit microprocessors with virtual memory,
caches, pipelined execution, hardware floating-point
Massively Parallel Processors (MPPs) appear
Use many cheap micros to approach supercomputer
performance (Sequent, Intel, Parsytec)
The Nineties
Asanovic/Devadas
Spring 2002
6.823
Distinction between workstation and PC disappears
Parallel microprocessor-based SMPs take over lowend server and supercomputer market
MPPs have limited success in supercomputing market
High-end mainframes and vector supercomputers
survive killer micro onslaught
64-bit addressing becomes essential at high-end
In 2001, 4GB DRAM costs <$5,000
CISC ISA (x86) thrives!
Asanovic/Devadas
Spring 2002
6.823
Reduced ISA Diversity in Nineties
Few major companies in general-purpose market
Intel x86 (CISC)
IBM 390 (CISC)
Sun SPARC, SGI MIPS, HP PA-RISC (all RISCs)
IBM/Apple/Motorola introduce PowerPC (another RISC)
Digital introduces Alpha (another RISC)
Software costs make ISA change prohibitively expensive
64-bit addressing extensions added to RISC instruction sets
Short vector multimedia extensions added to all ISAs, but without compiler
support
=> Focus on microarchitecture (superscalar, out-of-order)
CISC x86 thrives!
RISCs (SPARC, MIPS, Alpha, PowerPC) fail to make significant inroads into
desktop market, but important in server and technical computing markets
RISC advantage shrinks with superscalar out-of-order
execution
Intel Pentium Pro, (1995)
Asanovic/Devadas
Spring 2002
6.823
During decode, translate complex x86
instructions into RISC-like micro-operations
(uops)
e.g., R R op Mem translates into
load T, Mem
# Load from Mem into temp reg
R R op T
# Operate using value in temp
Execute uops using speculative out-of-order
superscalar engine with register renaming
Pentium Pro family architecture (P6 family) used
on Pentium-II and Pentium-III processors
Intel Pentium Pro (1995)
External Bus
L2 Cache
Memory Reorder
Buffer
Bus Interface
Data Cache
Branch
Target
Buffer
MicroInstruction
Sequencer
Register
Alias Table
Internal RISC-like micro-ops
Memory
Interface Unit
Reservation Station
Instruction Decoder
x86 CISC
macro
instructions
Instruction Cache
and Fetch Unit
Address
Generation Unit
Integer Unit
Floating-Point
Unit
Reorder Buffer
and Retirement
Register File
Asanovic/Devadas
Spring 2002
6.823
P6 Instruction Fetch & Decode
8KB I-cache, 4-way s.a.,
I-TLB
32-byte lines
32+4 entry
virtual index, physical tag
fully assoc.
16-byte aligned fetch of 16 bytes
Fetch
Buffer
(holds x86
insts.)
Simple
Decoder
1 uop
Simple
Decoder
1 uop
Complex
Decoder
PC from
branch
predictor
I-TLB has 32 entries for
4KB pages plus 4 entries
for 4MB pages
uop Buffer
(6 entries)
1-4 uops
uop Sequencer
(microcode)
Asanovic/Devadas
Spring 2002
6.823
P6 uops
Asanovic/Devadas
Spring 2002
6.823
Each uop has fixed format of around 118 bits
opcode, two sources, and destination
sources and destination fields are 32-bits wide to hold
immediate or operand
Simple decoders can only handle simple x86
instructions that map to one uop
Complex decoder can handle x86 translations of
up to 4 uops
Complicated x86 instructions handled by
microcode engine that generates uop sequence
Intel data shows average of 1.2-1.7 uops per x86
instruction on SPEC95 benchmarks, 1.4-2.0 on MS
Office applications
Asanovic/Devadas
Spring 2002
6.823
P6 Reorder Buffer and Renaming
uop Buffer
(6 entries)
3 uops
/cycle
Reorder Buffer (ROB)
Allocate
ROB, RAT,
RS entries
Data
Status
40 entries
in ROB
Register Alias Table (RAT)
EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP
Retirement
Register File
(RRF)
Values move from ROB to architectural register file
(RRF) when committed
P6 Reservation Stations and
Execution Units
Asanovic/Devadas
Spring 2002
6.823
ROB
Renamed
uops
(3/cycle)
(40 entries)
Reservation Station (20 entries)
dispatch
up to 5
uops/cycle
Store
Data
stores only
leave MOB
when uop
commits
Store
Addr.
Load
Addr.
Int.
ALU
Int.
ALU
FP
ALU
Memory Reorder Buffer
(MOB)
1 store
D-TLB
1 load
8KB D-cache, 4-way s.a., 32-byte lines,
divided into 4 interleaved banks
D-TLB has 64 entries for 4KB pages fully assoc.,
plus 8 entries for 4MB pages, 4-way s.a.
Load
data
Asanovic/Devadas
Spring 2002
6.823
P6 Retirement
After uop writes back to ROB with no outstanding
exceptions or mispredicts, becomes eligible for
retirement
Data written to RRF from ROB
ROB entry freed, RAT updated
uops retired in order, up to 3 per cycle
Have to check and report exceptions at valid x86
instruction fault points
complex instructions (e.g., string move) may generate
thousands of uops
Asanovic/Devadas
Spring 2002
6.823
P6 Pipeline
RS
x86uop
Write RS Exec.
Decode Rename Read
Retire
BTB I-Cache
Access Access
Fetch buffer
uop buffer
Reservation Station
10 11 12
ROB
Addr.
Calc.
Load pipeline
D-cache
L2 Access
Retire
L1 L2 L3 L4 L5 L6 11 12
MOB
ROB
Bypass L2
access if L1 hit
P6 Branch Penalties
RS
x86uop
Write RS Exec.
Decode Rename Read
Retire
BTB I-Cache
Access Access
Fetch buffer
Asanovic/Devadas
Spring 2002
6.823
uop buffer
10 11 12
ROB
Reservation Station
Branch mispredict penalty
P6 Branch Target Buffer (BTB)
Asanovic/Devadas
Spring 2002
6.823
512 entries, 4-way set-associative
Holds branch target, plus two-level BHT for
taken/not-taken
Unconditional jumps not held in BTB
One cycle bubble on correctly predicted taken
branches (no penalty if correctly predicted nottaken)
Two-Level Branch Predictor
Asanovic/Devadas
Spring 2002
6.823
Pentium Pro uses the result from the last two branches
to select one of the four sets of BHT bits (~90-95% correct)
00
Fetch PC
2-bit global branch
history shift register
Shift in
Taken/Taken
results of each
branch
Taken/Taken?
P6 Static Branch Prediction
Asanovic/Devadas
Spring 2002
6.823
If a branch misses in BTB, then static prediction
performed
Backwards branch predicted taken, forwards
branch predicted not-taken
Asanovic/Devadas
Spring 2002
6.823
P6 Branch Penalties
RS
x86uop
Write RS Exec.
Decode Rename Read
Retire
BTB I-Cache
Access Access
BTB
predicted
taken
penalty
Fetch buffer
uop buffer
10 11 12
ROB
Reservation
Station
Decode and predict
branch that missed in BTB
(backwards taken,
forwards not-taken)
Branch resolved
Asanovic/Devadas
Spring 2002
6.823
P6 System
PCI Bus
DRAM
Memory
controller
AGP
Bus
AGP
Graphics
Card
Glueless SMP to 4 procs., split-transaction
Frontside bus
CPU
CPU
CPU
CPU
L1 I$ L1 D$
L1 I$ L1 D$
L1 I$ L1 D$
L1 I$ L1 D$
L2 $
L2 $
L2 $
Backside bus
L2 $
Pentium-III Die Photo
Programmable
Interrupt Control
Asanovic/Devadas
Spring 2002
6.823
External and Backside
Packed FP Datapaths
Bus Logic
Integer Datapaths
Page Miss Handler
Floating-Point
Datapaths
Memory Order
Buffer
Memory Interface
Unit (convert floats
Clock
to/from memory
format)
16KB
4-way
s.a. D$
MMX Datapaths
Register Alias Table
Allocate entries
(ROB, MOB, RS)
Reservation
Station
Branch
Address Calc
Reorder Buffer
(40-entry physical
regfile + architect.
regfile)
256KB
8-way s.a.
Instruction Fetch Unit:
16KB 4-way s.a. I-cache
Instruction Decoders:
3 x86 insts/cycle
Microinstruction
Sequencer
Pentium Pro vs MIPS R10000
Asanovic/Devadas
Spring 2002
6.823
External Bus
Interface
Execution
Units
D-cache
I-cache
Estimates of 30% hit for CISC versus RISC
compare with original RISC Advantage of 2.6
RISC Advantage decreased because size of out-of-order core
largely independent of original ISA
Asanovic/Devadas
Spring 2002
6.823
Advanced CISC Implementations:
Pentium 4
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Intel Pentium Pro (1995)
Asanovic/Devadas
Spring 2002
6.823
During decode, translate complex x86
instructions into RISC-like micro-operations
(uops)
e.g., R R op Mem translates into
load T, Mem
# Load from Mem into temp reg
R R op T
# Operate using value in temp
Execute uops using speculative out-of-order
superscalar engine with register renaming
Pentium Pro family architecture (P6 family) used
on Pentium-II and Pentium-III processors
Intel Pentium 4 (2000)
Asanovic/Devadas
Spring 2002
6.823
Deeper pipelines than P6 family
about half as many levels of logic per pipeline stage as P6
Trace cache holds decoded uops
only has a single x86->uop decoder
Decreased latency in same process technology
aggressive circuit design
new microarchitectural tricks
This lecture contains figures and data taken from: The microarchitecture
of the Pentium 4 processor, Intel Technology Journal, Q1, 2001
Pentium 4 Block Diagram
Asanovic/Devadas
Spring 2002
6.823
P-III vs. P-4 Pipelines
Asanovic/Devadas
Spring 2002
6.823
In same process technology, ~1.5x clock frequency
Performance Equation:
Time = Instructions * Cycles * Time
Program
Program
Instruction Cycle
Asanovic/Devadas
Spring 2002
6.823
Apple Marketing for G4
Shorter data pipeline
The performance advantage of the PowerPC G4
starts with its data pipeline. The term processor
pipeline refers to the number of processing steps,
or stages, it takes to accomplish a task. The fewer
the steps, the shorter and more efficient the
pipeline. Thanks to its efficient 7-stage design
(versus 20 stages for the Pentium 4 processor)
the G4 processor can accomplish a task with 13
fewer steps than the PC. You do the math.
Asanovic/Devadas
Spring 2002
6.823
Relative Frequency of Intel Designs
Over time, fewer logic levels per pipeline stage and
more advanced circuit design
Higher frequency in same process technology
Deep Pipeline Design
Asanovic/Devadas
Spring 2002
6.823
Greater potential throughput but:
Clock uncertainty and latch delays eat into cycle time
budget
doubling pipeline depth gives less than twice frequency
improvement
Clock load and power increases
more latches running at higher frequencies
More complicated microarchitecture needed to cover
long branch mispredict penalties and cache miss
penalties
from Littles Law, need more instructions in flight to cover longer
latencies larger reorder buffers
P-4 has three major clock domains
Double pumped ALU (3 GHz), small critical area at highest speed
Main CPU pipeline (1.5 GHz)
Trace cache (0.75 GHz), save power
Pentium 4 Trace Cache
Asanovic/Devadas
Spring 2002
6.823
Holds decoded uops in predicted program flow
order, 6 uops per line
Code in memory
cmp
br T1
...
T1: sub
br T2
...
T2: mov
sub
br T3
...
T3: add
sub
mov
br T4
...
T4:
Code packed in trace cache
(6 uops/line)
cmp
br T2
br T1
mov
sub
sub
br T3
mov
add
br T4
sub
T4:...
Trace cache fetches one 6 uop line
every 2 CPU clock cycles (runs at
main CPU rate)
Trace Cache Advantages
Asanovic/Devadas
Spring 2002
6.823
Removes x86 decode from branch mispredict
penalty
Parallel x86 decoder took 2.5 cycles in P6, would be 5 cycles in P-4
design
Allows higher fetch bandwidth fetch for correctly
predicted taken branches
P6 had one cycle bubble for correctly predicted taken branches
P-4 can fetch a branch and its target in same cycle
Saves energy
x86 decoder only powered up on trace cache refill
Pentium 4 Front End
L2 Cache
x86 instructions,
8 Bytes/cycle
Inst. Prefetch
& TLB
Fetch Buffer
Asanovic/Devadas
Spring 2002
6.823
Front End
BTB
(4K Entries)
Single x86 instruction/cycle
x86 Decoder
4 uops/cycle
Trace Cache Fill Buffer
6 uops/line
Trace Cache (12K uops)
Translation from x86
instructions to internal
uops only happens on
trace cache miss, one x86
instruction per cycle.
Translations are cached in
trace cache.
Asanovic/Devadas
Spring 2002
6.823
P-4 Trace Cache Fetch
1
TC Next IP
(BTB)
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive
Trace Cache
(12K uops, 2K lines of 6 uops)
Trace IP
Microcode
ROM
Trace BTB
(512 entries)
16-entry
subroutine return
address stack
6 uops every two
CPU cycles
uop buffer
3 uops/cycle
P-III vs. P-4 Renaming
1
TC Next IP
(BTB)
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive
Asanovic/Devadas
Spring 2002
6.823
P-4 physical register file separated from ROB status.
ROB entries allocated sequentially as in P6 family.
One of 128 physical registers allocated from free list.
No data movement on retire, only Retirement RAT
updated.
Asanovic/Devadas
Spring 2002
6.823
P-4 Op Queues and Schedulers
1
TC Next IP
(BTB)
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive
Allocated/Renamed uops
3 uops/cycle
Memory uop
Queue
Memory
Scheduler
uop queues are
in-order within
each queue
Fast
Fast
Scheduler
Scheduler
(x2)
Arithmetic
uop Queue
General
Scheduler
Simple FP
Scheduler
Ready uops compete for dispatch ports
(Fast schedulers can each dispatch 2 ALU
operations per cycle)
P-4 Execution Ports
Asanovic/Devadas
Spring 2002
6.823
Schedulers compete for access to execution ports
Loads and stores have dedicated ports
ALUs can execute two operations per cycle
Peak bandwidth of 6 uops per cycle
load, store, plus four double-pumped ALU operations
P-4 Fast ALUs and Bypass Path
Asanovic/Devadas
Spring 2002
6.823
Register
File and
Bypass
Network
L1 Data
Cache
Fast ALUs and bypass network runs at double speed
All non-essential circuit paths handled out of loop to reduce circuit
loading (shifts, mults/divs, branches, flag/ops)
Other bypassing takes multiple clock cycles
P-4 Staggered ALU Design
Asanovic/Devadas
Spring 2002
6.823
Staggers 32-bit add and flag
compare into three cycle
phases
low 16 bits
high 16 bits
flag checks
Bypass 16 bits around every
cycle
back-back dependent 32-bit adds
at 3GHz in 0.18m
L1 Data Cache access
starts with bottom 16 bits as
index, top 16 bits used as
tag check later
Asanovic/Devadas
Spring 2002
6.823
P-4 Load Schedule Speculation
1
TC Next IP
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17 Load Execute 1
18 Load Execute 2
19 Branch Check
20
Drive
Long delay from
schedulers to load
hit/miss
P-4 guesses that load will hit in L1 and
schedules dependent operations to use
value
If load misses, only dependent
operations are replayed
P-4 Branch Penalty
1
TC Next IP
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive
Asanovic/Devadas
Spring 2002
6.823
20 cycle branch
mispredict penalty
P-4 uses new trade secret branch
prediction algorithm
Intel claims 1/3 fewer mispredicts than
P6 algorithm
P-4 Microarchitecture
Asanovic/Devadas
Spring 2002
6.823
Pentium-III Die Photo
Programmable
Interrupt Control
Asanovic/Devadas
Spring 2002
6.823
External and Backside
Packed FP Datapaths
Bus Logic
Integer Datapaths
Page Miss Handler
Floating-Point
Datapaths
Memory Order
Buffer
Memory Interface
Unit (convert floats
Clock
to/from memory
format)
16KB
4-way
s.a. D$
MMX Datapaths
Register Alias Table
Allocate entries
(ROB, MOB, RS)
Reservation
Station
Branch
Address Calc
Reorder Buffer
(40-entry physical
regfile + architect.
regfile)
256KB
8-way s.a.
Instruction Fetch Unit:
16KB 4-way s.a. I-cache
Instruction Decoders:
3 x86 insts/cycle
Microinstruction
Sequencer
Scaling of Wire Delay
Asanovic/Devadas
Spring 2002
6.823
Over time, transistors are getting relatively faster
than long wires
wire resistance growing dramatically with shrinking width and
height
capacitance roughly fixed for constant length wire
RC delays of fixed length wire rising
Chips are getting bigger
P-4 >2x size of P-III
Clock frequency rising faster than transistor speed
deeper pipelines, fewer logic gates per cycle
more advanced circuit designs (each gate goes faster)
Takes multiple cycles for signal to cross chip
Visible Wire Delay in P-4 Design
1
TC Next IP
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive
Asanovic/Devadas
Spring 2002
6.823
Pipeline stages dedicated to just
driving signals across chip!
Asanovic/Devadas
Spring 2002
6.823
Instruction Set Translation
Convert a target ISA into a host machines ISA
Pentium Pro (P6 family)
translation in hardware after instruction fetch
Pentium-4 family
translation in hardware at level 1 instruction cache refill
Transmeta Crusoe
translation in software using Code Morphing
Asanovic/Devadas
Spring 2002
6.823
Virtual Machines and Dynamic
Translation:
Implementing ISAs in Software
Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology
Software Applications
Asanovic/Devadas
Spring 2002
6.823
How is a software application encoded?
What are you getting when you buy a software application?
What machines will it work on?
Who do you blame if it doesnt work, i.e., what contract(s) were
violated?
ISA + Environment =
Virtual Machine
Asanovic/Devadas
Spring 2002
6.823
ISA alone not sufficient to write useful programs, need I/O
Direct access to memory mapped I/O via load/store
instructions problematic
time-shared systems
portability
Operating system responsible for I/O
sharing devices and managing security
hiding different types of hardware (e.g., EIDE vs. SCSI disks)
ISA communicates with operating system through
some standard mechanism, i.e., syscall instructions
example convention to open file:
addi r1, r0, 27
# 27 is code for file open
addu r2, r0, rfname
# r2 points to filename string
syscall
# cause trap into OS
# On return from syscall, r1 holds file descriptor
Application Binary Interface
(ABI)
Asanovic/Devadas
Spring 2002
6.823
Programs are usually distributed in a binary format
that encode the program text (instructions) and
initial values of some data segments
Virtual machine specifications include
what state is available at process creation
which instructions are available (the ISA)
what system calls are possible (I/O, or the environment)
The ABI is a specification of the binary format used
to encode programs for a virtual machine
Operating system implements the virtual machine
at process startup, OS reads the binary program, creates an
environment for it, then begins to execute the code, handling traps
for I/O calls, emulation, etc.
OS Can Support Multiple VMs
Asanovic/Devadas
Spring 2002
6.823
Virtual machine features change over time with
new versions of operating system
new ISA instructions added
new types of I/O are added (e.g., asynchronous file I/O)
Common to provide backwards compatibility so
old binaries run on new OS
SunOS 5 (System V Release 4 Unix, Solaris) can run binaries
compiled for SunOS4 (BSD-style Unix)
Windows 98 runs MS-DOS programs
If ABI needs instructions not supported by native
hardware, OS can provide in software
Supporting Multiple OSs on
Same Hardware
Asanovic/Devadas
Spring 2002
6.823
Can virtualize the environment that an operating
system sees, an OS-level VM
Hypervisor layer implements sharing of real hardware
resources by multiple OS VMs that each think they
have a complete copy of the machine
Popular in early days to allow mainframe to be shared by multiple
groups developing OS code
Used in modern mainframes to allow multiple versions of OS to be
running simultaneously OS upgrades with no downtime!
Example for PCs: VMware allows Windows OS to run on top of Linux
(or vice-versa)
Requires trap on access to privileged state that OS
needs
easier if OS interface to hardware well defined
ISA Implementations Partly in
Software
Asanovic/Devadas
Spring 2002
6.823
Often good idea to implement part of ISA in software:
Expensive but rarely used instructions can cause trap
to OS emulation routine:
e.g., decimal arithmetic instructions in MicroVax
implementation of VAX ISA
Infrequent but difficult operand values can cause trap
e.g., IEEE floating-point denormals cause traps in almost all
floating-point unit implementations
Old machine can trap unused opcodes, allows
binaries for new ISA to run on old hardware
e.g., Sun SPARC v8 added integer multiply instructions, older
v7 CPUs trap and emulate
Supporting Non-Native ISAs
Asanovic/Devadas
Spring 2002
6.823
Run programs for one ISA on hardware with different ISA
Techniques:
Emulation
OS software interprets instructions at run-time
E.g., OS for PowerPC Macs had emulator for 68000 code
Binary Translation
convert at install and/or load time
IBM AS/400 to modified PowerPC cores
DEC tools for VAX->MIPS->Alpha
Dynamic Translation (or Dynamic Compilation)
compile non-native ISA to native ISA at run time
Suns HotSpot Java JIT (just-in-time) compiler
Transmeta Crusoe, x86->VLIW code morphing
Run-time Hardware Emulation
Hardware supports two ISAs!
IBM 360 had IBM 1401 emulator in microcode
Intel Itanium converts x86 to native VLIW (two software-visible ISAs)
Emulation
Asanovic/Devadas
Spring 2002
6.823
Software instruction set interpreter fetches and
decodes one instruction at a time in emulated VM
Emulator Stack
Guest
Stack
Executable
on Disk
Guest
ISA
Data
Guest
ISA
Code
Guest
ISA
Data
Load into
emulator
memory
Guest
ISA
Code
Emulator Data
Emulator Code
Memory image of
guest VM lives in
host emulator
data memory
fetch-decode loop
while(!stop)
{
inst = Code[PC];
PC += 4;
execute(inst);
}
Emulation
Easy to code, small code footprint
Slow, approximately 100x slower than native
execution for RISC ISA hosted on RISC ISA
Problem is time taken to decode instructions
fetch instruction from memory
switch tables to decode opcodes
extract register specifiers using bit shifts
access register file data structure
execute operation
return to main fetch loop
Asanovic/Devadas
Spring 2002
6.823
Binary Translation
Asanovic/Devadas
Spring 2002
6.823
Each guest ISA instruction translates into some
set of host (or native) ISA instructions
Instead of dynamically fetching and decoding
instructions at run-time, translate entire binary
program and save result as new native ISA
executable
Removes interpretive fetch-decode overhead
Can do compiler optimizations on translated code
to improve performance
register allocation for values flowing between guest ISA
instructions
native instruction scheduling to improve performance
remove unreachable code
inline assembly procedures
Binary Translation, Take 1
Asanovic/Devadas
Spring 2002
6.823
Executable
on Disk
Executable
on Disk
Guest
ISA
Data
Guest
ISA
Code
Data
unchanged
Translate to
native ISA code
Guest
ISA
Data
Native
Data
Native
ISA
Code
Native translation
might need extra data
workspace
Binary Translation Problems
Asanovic/Devadas
Spring 2002
6.823
Branch and Jump targets
guest code:
j L1
...
L1: lw r1, (r4)
jr (r1)
native code
j
translation
native jump at end of
block jumps to native
translation of lw
lw
translation
jr
translation
Where should the jump register go?
PC Mapping Table
Asanovic/Devadas
Spring 2002
6.823
Table gives translated PC for each hosted PC
Indirect jumps translated into code that looks in
table to find where to jump to
can optimize well-behaved guest code for subroutine
call/return by using native PC in return links
If can branch to any guest PC, then need one
table entry for every instruction in hosted
program big table
If can branch to any PC, then either
limit inter-instruction optimizations
large code explosion to hold optimizations for each possible
entry into sequential code sequence
Only minority of guest instructions are indirect
jump targets, want to find these
highly structured VM design
run-time feedback of where targets were
Binary Translation Problems
Asanovic/Devadas
Spring 2002
6.823
Self-modifying code!
sw r1, (r2)
# r2 points into code space
Rare in most code, but has to be handled if allowed
by guest ISA
Usually handled by including interpreter and
marking modified code pages as interpret only
Have to invalidate all native branches into modified
code pages
Binary Translation, Take 2
Asanovic/Devadas
Spring 2002
6.823
Executable
on Disk
Guest
ISA Data
Executable
on Disk
Guest
ISA
Data
Guest
ISA
Code
Keep copy
of code and
data in
native data
segment
Guest
ISA Code
PC
Mapping
Table
Translate to
native ISA code
Native
ISA Code
Native
Emulator
Mapping table used
for indirect jumps and
to jump from emulator
back into native
translations
Translation has to
check for modified
code pages then jump
to emulator
Emulator used for
run-time modified
code, checks for
jumps back into
native code using PC
mapping table
IBM System/38 and AS/400
Asanovic/Devadas
Spring 2002
6.823
System/38 announced 1978, AS/400 is follow-on line
High-level instruction set interface designed for binary
translation
Memory-memory style instruction set, never directly
executed by hardware
User Applications
Languages,
Database,
Utilities
Control
Program
Facility
High-Level
Architecture Interface
Vertical Microcode
Used 48-bit CISC
engine in earlier
machines
Horizontal Microcode
Hardware Machine
Replaced by modified
PowerPC cores in
newer AS/400 machines
Dynamic Translation
Asanovic/Devadas
Spring 2002
6.823
Translate code sequences as needed at run-time,
but cache results
Can optimize code sequences based on dynamic
information (e.g., branch targets encountered)
Tradeoff between optimizer run-time and time
saved by optimizations in translated code
Technique used in Java JIT (Just-In-Time)
compilers
Also, Transmeta Crusoe for x86 emulation
Transmeta Crusoe
Asanovic/Devadas
Spring 2002
6.823
Converts x86 ISA into internal native VLIW format
using software at run-time Code Morphing
Optimizes across x86 instruction boundaries to
improve performance
Translations cached to avoid translator overhead
on repeated execution
Completely invisible to operating system looks
like x86 hardware processor
Transmeta VLIW Engine
Asanovic/Devadas
Spring 2002
6.823
Two VLIW formats, 64-bit and 128-bit, contains 2
or 4 RISC-like operations
VLIW engine optimized for x86 code emulation
evaluates condition codes the same way as x86
has 80-bit floating-point unit
partial register writes (update 8 bits in 32 bit register)
Support for fast instruction writes
run-time code generation important
Two different VLIW implementations, low-end
TM3120, high-end TM5400
native ISA differences invisible to user, hidden by translation
system
Asanovic/Devadas
Spring 2002
6.823
Crusoe System
Crusoe
Boot
Flash
ROM
Inst. Cache
VLIW Processor
Portion of system DRAM is
used by Code Morph
software and is invisible to
x86 machine
Data Cache
Compressed
compiler held in
boot ROM
Crusoe CPU
Code Morph
Compiler Code
(VLIW)
Translation
Cache (VLIW)
Workspace
Code Morph DRAM
System DRAM
x86 DRAM
x86 BIOS
Flash
Transmeta Translation
Asanovic/Devadas
Spring 2002
6.823
x86 code:
addl %eax, (%esp) # load data from stack, add to eax
addl %ebx, (%esp) # load data from stack, add to ebx
movl %esi, (%ebp) # load esi from memory
subl %ecx, 5
# sub 5 from ecx
first step, translate into RISC ops:
ld %r30, [%esp]
# load from stack into temp
add.c %eax, %eax, %r30 # add to %eax, set cond.codes
ld %r31, [%esp]
add.c %ebx, %ebx, %r31
ld %esi, [%ebp]
sub.c %ecx, %ecx, 5
Compiler Optimizations
Asanovic/Devadas
Spring 2002
6.823
RISC ops:
ld %r30, [%esp]
# load from stack into temp
add.c %eax, %eax, %r30 # add to %eax, set cond.codes
ld %r31, [%esp]
add.c %ebx, %ebx, %r31
ld %esi, [%ebp]
sub.c %ecx, %ecx, 5
Optimize:
ld %r30, [%esp]
# load from stack only once
add %eax, %eax, %r30
add %ebx, %ebx, %r30
# reuse data loaded earlier
ld %esi, [%ebp]
sub.c %ecx, %ecx, 5
# only this cond. code needed
Scheduling
Asanovic/Devadas
Spring 2002
6.823
Optimized RISC ops:
ld %r30, [%esp]
# load from stack only once
add %eax, %eax, %r30
add %ebx, %ebx, %r30
# reuse data loaded earlier
ld %esi, [%ebp]
sub.c %ecx, %ecx, 5
# only this cond. code needed
Schedule into VLIW code:
ld %r30, [%esp]; sub.c %ecx, %ecx, 5
ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30
Translation Overhead
Asanovic/Devadas
Spring 2002
6.823
Highly optimizing compiler takes considerable
time to run, adds run-time overhead
Only worth doing for frequently executed code
Translation adds instrumentation into translations
that counts how often code executed, and which
way branches usually go
As count for a block increases, higher
optimization levels are invoked on that code
Exceptions
Asanovic/Devadas
Spring 2002
6.823
Original x86 code:
addl %eax, (%esp) # load data from stack, add to eax
addl %ebx, (%esp) # load data from stack, add to ebx
movl %esi, (%ebp) # load esi from memory
subl %ecx, 5
# sub 5 from ecx
Scheduled VLIW code:
ld %r30, [%esp]; sub.c %ecx, %ecx, 5
ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30
x86 instructions executed out-of-order with respect to
original program flow
Need to restore state for precise traps
Shadow Registers and Store
Buffer
Asanovic/Devadas
Spring 2002
6.823
All registers have working copy and shadow copy
Stores held in software controlled store buffer,
loads can snoop
At end of translation block, commit changes by
copying values from working regs to shadow
regs, and by releasing stores in store buffer
On exception, re-execute x86 code using
interpreter
Handling Self-Modifying Code
Asanovic/Devadas
Spring 2002
6.823
When a translation is made, mark the associated
x86 code page as being translated in page table
Store to translated code page causes trap, and
associated translations are invalidated
Multithreaded Processors
Page 1
Pipeline Hazards
LW r1, 0(r2)
LW r5, 12(r1)
ADDI r5, r5, #12
SW 12(r1), r5
Each instruction may depend on the next
Without bypassing, need interlocks
t0 t1 t2 t3 t4 t5 t6 t7 t8
LW r1, 0(r2)
LW r5, 12(r1)
ADDI r5, r5, #12
SW 12(r1), r5
t9 t10 t11 t12 t13 t14
F D X MW
F D D D D X MW
F F F F D D D D X MW
F F F F D D D D
Bypassing cannot completely eliminate interlocks
or delay slots
Page 2
Multithreading
How can we guarantee no dependencies between
instructions in a pipeline?
One way is to interleave execution of instructions
from different program threads on same pipeline
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
t0 t1 t2 t3 t4 t5 t6 t7
F D X MW
T1: LW r1, 0(r2)
F D X M
T2: ADD r7, r1, r4
F D X
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
F D
T1: LW r5, 12(r1)
F
t8
W
MW
X MW
D X MW
t9
Last instruction
in a thread
always completes
writeback before
next instruction
in same thread
reads regfile
3
Page 3
CDC 6600 Peripheral Processors
(Cray, 1965)
First multithreaded hardware
10 virtual I/O processors
fixed interleave on simple pipeline
pipeline has 100ns cycle time
each processor executes one instruction every 1000ns
accumulator-based instruction set to reduce processor state
Page 4
Simple Multithreaded Pipeline
PC
PC
PC 1
PC 1
1
1
I$
IR
GPR1
GPR1
Y
D$
+1
2
Thread
select
Have to carry thread select down pipeline to ensure
correct state bits read/written at each pipe stage
5
Page 5
Multithreading Costs
Appears to software (including OS) as multiple
slower CPUs
Each thread requires its own user state
GPRs
PC
Also, needs own OS control state
virtual memory page table base register
exception handling registers
Other costs?
6
Page 6
Thread Scheduling Policies
Fixed interleave (CDC 6600 PPUs, 1965)
each of N threads executes one instruction every N
cycles
if thread not ready to go in its slot, insert pipeline bubble
Software-controlled interleave (TI ASC PPUs, 1971)
OS allocates S pipeline slots amongst N threads
hardware performs fixed interleave over S slots,
executing whichever thread is in that slot
Hardware-controlled thread scheduling (HEP, 1982)
hardware keeps track of which threads are ready to go
picks next thread to execute based on hardware priority
scheme
7
Software-controlled interleave, S > N. Wheel with the threads in each slot. If
thread is ready to go
It goes, else NOP, I.e., pipeline bubble.
Page 7
What Grain Multithreading?
So far assumed fine-grained multithreading
CPU switches every cycle to a different thread
When does this make sense?
Coarse-grained multithreading
CPU switches every few cycles to a different
thread
When does this make sense?
Page 8
Multithreading Design Choices
CPU
RF
L1 Inst.
Cache
L1 Data
Cache
Memory
Unified
L2
Cache
Memory
Memory
Memory
Context switch to another thread every cycle,
or on hazard or L1 miss or L2 miss or network
request
Per-thread state and context-switch overhead
Interactions between threads in memory
hierarchy
Managing interactions between threads.
Page 9
Denelcor HEP
(Burton Smith, 1982)
First commercial machine to use hardware threading in main
CPU
120 threads per processor
10 MHz clock rate
Up to 8 processors
precursor to Tera MTA (Multithreaded Architecture)
10
Page 10
10
Tera MTA Overview
Up to 256 processors
Up to 128 active threads per processor
Processors and memory modules populate a sparse
3D torus interconnection fabric
Flat, shared main memory
No data cache
Sustains one main memory access per cycle per processor
50W/processor @ 260MHz
11
SGI bought Cray, and Tera was a spin-off.
1997. Integer sort press release.
Page 11
11
MTA Instruction Format
Three operations packed into 64-bit instruction word
(short VLIW)
One memory operation, one arithmetic operation,
plus one arithmetic or branch operation
Memory operations incur ~150 cycles of latency
Explicit 3-bit lookahead field in instruction gives
number of subsequent instructions (0-7) that are
independent of this one
c.f. Instruction grouping in VLIW
allows fewer threads to fill machine pipeline
used for variable
- sized branch delay slots
Thread creation and termination instructions
12
Page 12
12
MTA Multithreading
Each processor supports 128 active hardware threads
128 SSWs, 1024 target registers, 4096 general-purpose
registers
Every cycle, one instruction from one active thread is
launched into pipeline
Instruction pipeline is 21 cycles long
At best, a single thread can issue one instruction
every 21 cycles
Clock rate is 260MHz, effective single thread issue rate is
260/21 = 12.4MHz
13
32 64-bit general-purpose registers (R0-R31)
unified integer/floating-point register set
R0 hard-wired to zero
8 64-bit branch target registers (T0-T7)
load branch target address before branch instruction
T0 contains address of user exception handler
1 64-bit stream status word (SSW)
includes 32-bit program counter
four condition code registers
floating-point rounding mode
Page 13
13
MTA Pipeline
Issue Pool
Inst Fetch
Write Pool
Memory Pool
Retry Pool
Interconnection Network
Memory pipeline
14
Memory unit is busy, or sync operation failed retry.
Just goes around the memory pipeline.
Page 14
14
Coarse-Grain Multithreading
Tera MTA designed for supercomputing
applications with large data sets and low locality
No data cache
Many parallel threads needed to hide large memory
latency
Other applications are more cache friendly
Few pipeline bubbles when cache getting hits
Just add a few threads to hide occasional cache
miss latencies
Swap threads on cache misses
15
Tera not very successful, 2 machines sold.
Changed their name back to Cray!
Page 15
15
MIT Alewife
Modified SPARC chips
register windows hold
different thread contexts
Up to four threads per node
Thread switch on local cache
miss
16
Page 16
16
IBM PowerPC RS64-III (Pulsar)
Commercial coarse-grain multithreading CPU
Based on PowerPC with quad-issue in-order fivestage pipeline
Each physical CPU supports two virtual CPUs
On L2 cache miss, pipeline is flushed and
execution switches to second thread
short pipeline minimizes flush penalty (4 cycles), small
compared to memory access latency
flush pipeline to simplify exception handling
17
Page 17
17
Speculative, Out-of-Order Superscalar
Processor
Out-of-Order
PC
Fetch
Decode &
Rename
Reorder Buffer
In-Order
Commit
In-Order
Physical Reg. File
Branch
ALU MEM
Unit
Execute
Store
Buffer
D$
18
Page 18
18
Superscalar Machine Efficiency
Issue width
Instruction
issue
Completely idle cycle
(vertical waste)
Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)
Why horizontal waste?
Why vertical waste?
19
Page 19
19
Vertical Multithreading
Issue width
Instruction
issue
Second thread interleaved
cycle-by-cycle
Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)
Cycle-by-cycle interleaving of second thread
removes vertical waste
20
Page 20
20
Ideal Multithreading for Superscalar
Issue width
Time
Interleave multiple threads to multiple issue slots
with no restrictions
21
Page 21
21
Simultaneous Multithreading
Add multiple contexts and fetch engines to
wide out-of-order superscalar processor
[Tullsen, Eggers, Levy, UW, 1995]
OOO instruction window already has most of
the circuitry required to schedule from
multiple threads
Any single thread can utilize whole machine
22
Page 22
22
Comparison of Issue Capabilities
Courtesy of Susan Eggers; Used with Permission
Illustrates SMT thread issue & execution & how differs
SS: only single thread; long latency instructions w. lots of instructions
dependent
FGMT: limited by amount of ILP in each thread, just as on the SS
MP: each processor issues instructions from its own thread
Example of one thread stalls or has little ILP
Performance
Page 23
From Superscalar to SMT
SMT is an out-of-order superscalar extended with
hardware to support multiple executing threads
Fetch
Unit
IQ
Renaming
Registers
Functional
Units
ICOUNT
Threads
24
no special HW for scheduling instructions from different threads onto
FUs
can use same ooo mechanism as superscalar for instruction issue:
RENAMING HW eliminates false dependences both within a thread (just
like a conventional SS) & between threads
MAP thread-specific architectural registers in all threads onto a
pool of physical registers
instructions are issued when operands available without regard
to thread
(scheduler not look at thread IDs)
thereafter called by their physical name
Page 24
24
From Superscalar to SMT
Extra pipeline stages for accessing thread-shared
register files
Fetch
Unit
IQ
Renaming
Registers
Functional
Units
ICOUNT
Threads
25
8*32 for the architecture state + 96 additional registers for register
renaming
Page 25
25
From Superscalar to SMT
Fetch from the two highest throughput threads.
Why?
Fetch
Unit
IQ
Renaming
Registers
Functional
Units
ICOUNT
Threads
26
Fetch unit that can keep up with the simultaneous multithreaded
execution engine
have the fewest instructions waiting to be executed
making the best progress through the machine
40% increase in IPC over RR
Page 26
26
From Superscalar to SMT
Small items
per-thread program counters
per-thread return stacks
per-thread bookkeeping for instruction retirement,
trap & instruction dispatch queue flush
thread identifiers, e.g., with BTB & TLB entries
27
none of small stuff endangers critical path
most mechanisms already exist; now duplicated for each thread or
implemented to apply only to 1 thread at a time
carry thread ID for retirement, trap, queue flush, not used for
scheduling
HW structure that points to all Is for each thread
need flush mechanism for branch misprediction
Fairly straightforward extension to OOO SS; this + n-fold performance
boost was responsible for the technology transfer to chip
manufacturers
Page 27
27
Simultaneous Multithreaded Processor
Rename Table 1
Rename Table 1
Rename Table 1
PC
PC
PC
PC
Fetch
1
Decode &
Rename
OOO execution unit
does not see thread
identifiers, only
physical register
specifiers
Reorder Buffer
Commit
Commit
Commit
Commit
Physical Reg. File
Branch
ALU MEM
Unit
Execute
Store
Buffer
D$
28
Page 28
28
SMT Design Issues
Which thread to fetch from next?
Dont want to clog instruction window with
thread with many stalls try to fetch from
thread that has fewest insts in window
Locks
Virtual CPU spinning on lock executes many
instructions but gets nowhere add ISA
support to lower priority of thread spinning on
lock
29
Page 29
29
Intel Pentium-4 Xeon Processor
Hyperthreading == SMT
Dual physical processors, each 2-way SMT
Logical processors share nearly all resources
of the physical processor
Caches, execution units, branch predictors
Die area overhead of hyperthreading ~ 5%
When one logical processor is stalled, the
other can make progress
No logical processor can use all entries in
queues when two threads are active
A processor running only one active software
thread to run at the same speed with or
without hyperthreading
30
Load-store buffer in L1 cache doesnt behave like that, and hence 15%
slowdown.
Page 30
30
Parallel Processors
Page 1
Parallel Processing: The Holy Grail
Use multiple processors to improve runtime of a
single task
technology limits speed of uniprocessor
economic advantages to using replicated processing units
Preferably programmed using a portable high-level
language
When are two heads better than one?
Jigsaw puzzle analogy
Page 2
Motivating Applications
Weather forecasting
Climate modeling
Material science
Drug design
Computational genomics
And many more
3
Number crunching apps
Claim that weather forecasts are accurate over 5 days, and would like to
improve that.
Nonlinear phenomena.
Page 3
3
Flynns Classification (1966)
Broad classification of parallel computing systems
based on number of instruction and data streams
SISD: Single Instruction, Single Data
conventional uniprocessor
SIMD: Single Instruction, Multiple Data
distributed memory SIMD (MPP, DAP, CM-1&2, Maspar)
shared memory SIMD (STARAN, vector computers)
MIMD: Multiple Instruction, Multiple Data
message passing machines (Transputers, nCube, CM-5)
non-cache-coherent SMPs (BBN Butterfly, T3D)
cache-coherent SMPs (Sequent, Sun Starfire, SGI Origin)
MISD: Multiple Instruction, Single Data
no commercial examples
MISD, check that a number is prime, do a divide on the same number with
Processor number I.
Page 4
Potential of the four classes
+
A
B
SISD
+, *
A
B
MISD
+
A, B
C, D
SIMD
+, *
A, B
C, D
MIMD
A+B
A+B
A*B
A+B
C+D
A+B
C*D
5
Page 5
SIMD Architecture
Central controller broadcasts instructions to
multiple processing elements (PEs)
Inter-PE Connection Network
Array
Controller
PE
PE
PE
PE
PE
PE
PE
PE
M
e
m
M
e
m
M
e
m
M
e
m
M
e
m
M
e
m
M
e
m
M
e
m
Control
Data
Only requires one controller for whole array
Only requires storage for one copy of program
All computations fully synchronized
Page 6
SIMD Machines
Illiac IV (1972)
64 64-bit PEs, 16KB/PE, 2D network
ICL DAP (Distributed Array Processor) (1980)
4K bit-serial PEs, 512B/PE, 2D network
Thinking Machines Connection Machine CM-1 (1985)
64K bit-serial PEs, 512B/PE, 2D + hypercube router
CM-2: 2048B/PE, plus 2,048 32-bit floating-point units
Maspar MP-1 (1989)
16K 4-bit processors, 16-64KB/PE, 2D-mesh + Xbar
MP-2: 16K 32-bit processors, 64KB/PE
Also shared memory SIMD vector supercomputers
TI ASC (71), CDC Star- 100 (73), Cray
- 1(76)
7
Illiac IV: $31M instead of $8M paid by DARPA, Illiaction
Glypnir language: Norse mythological rope that controls a vicious animal
Page 7
SIMD Today
Distributed memory SIMD failed as large-scale
general-purpose computer platform
Why?
Vector supercomputers (shared memory SIMD) still
successful in high-end supercomputing
Reasonable efficiency on short vector lengths (10-100)
Single memory space
Distributed memory SIMD popular for special purpose
accelerators, e.g., image and graphics processing
Renewed interest for Processor-in-Memory (PIM)
Memory bottlenecks put simple logic close to memory
Viewed as enhanced memory for conventional system
Need merged DRAM + logic fabrication process
Commercial examples, e.g., graphics in Sony Playstation-2
Required huge quantities of data parallelism (> 10000 elements)
Required programmer-controlled distributed data layout.
Page 8
MIMD Machines
Message passing, distributed memory
Thinking Machines CM-5
Intel Paragon
Meiko CS-2
many cluster systems (e.g., IBM SP-2, Linux Beowulfs)
Shared memory
no hardware cache coherence
IBM RP3
BBN Butterfly
Cray T3D/T3E
Parallel vector supercomputers (Cray T90, NEC SX-5)
hardware cache coherence
many small-scale SMPs (e.g. Quad Pentium Xeon systems)
large scale bus/crossbar-based SMPs (Sun Starfire)
large scale directory-based SMPs (SGI Origin)
Thinking machines, interconnect network a FAT-TREE.
Rp3: research parallel processor prototype
BBN: Bolt Beranek and Newman, A spiffy interconnect will not by itself
Allow the construction of a supercomputer out of common household
appliances.
Problem was the commodity processors did not support multiple outstanding
Memory requests. So the remote memory access killed off performance,
unless
Excellent data layout was achieved. So it lost to message-passing networks.
Page 9
9
Summing Numbers
On a sequential computer
Sum = a[0]
for(i = 0; i < m; i++)
Sum = Sum + a[i]
(m) complexity
Have N processors adding up m/N numbers
Shared memory
Global-sum = 0
for each processor {
local-sum = 0
Calculate local-sum of m/N numbers
Lock
Global-sum = Global-sum + local-sum
Unlock
}
Complexity ?
10
Theta(m/N) + Theta(N)
Theta(N) comes from summing N numbers in series.
Page 10
10
Summing Numbers Distributed Memory
Assume a square mesh of processors
Each processor computes the local sum of its
m/N numbers
Each processor passes its local sum to
another processor in a coordinated way
The global sum in finally in processor P11
P11
P12
P13
P11
P12
P13
P11
P12
P13
j
11
Page 11
11
Complexity
P11
P12
P13
P11
P12
P13
P11
P12
P13
How many additions?
How many communications?
Total time complexity?
12
Sqrt(N) 1 additions, sqrt(N) 1 communications in each direction
Theta(m/N) + Theta(sqrt(N)) + C, C is cost of communication
Page 12
12
Message Passing MPPs
(Massively Parallel Processors)
Initial Research Projects
Caltech Cosmic Cube (early 1980s) using custom Mosaic
processors
Commercial Microprocessors including MPP Support
Transputer (1985)
nCube-1(1986) /nCube-2 (1990)
Standard Microprocessors + Network Interfaces
Intel Paragon (i860)
TMC CM-5 (SPARC)
IBM SP-2 (RS/6000)
Interconnect Network
NI
NI
NI
NI
NI
NI
NI
NI
MPP Vector Supers
Fujitsu VPP series
Designs scale to 100s-10,000s
of nodes
Mem Mem Mem Mem Mem Mem Mem Mem 13
Read up on Cosmic Cube, transputer, CM5 other machines.
Page 13
13
Message Passing MPP Problems
All data layout must be handled by software
cannot retrieve remote data except with message
request/reply
Message passing has high software overhead
early machines had to invoke OS on each message
(100s
- 1ms/message)
even user level access to network interface has
dozens of cycles overhead (NI might be on I/O bus)
Cost of sending messages?
Cost of receiving messages?
14
Sending can be cheap (like stores)
Receiving is expensive, need to poll or interrupt.
Page 14
14
Shared Memory Machines
Two main categories with respect to caches
non cache coherent
hardware cache coherent
Will work with any data placement (but might be slow)
can choose to optimize only critical portions of code
Load and store instructions used to communicate
data between processes
no OS involvement
low software overhead
Usually some special synchronization primitives
fetch&op
load linked/store conditional
In large scale systems, logically shared memory is
implemented as physically distributed modules
15
Page 15
15
Shared Memory Machines
Exclusive Read, Exclusive Write (EREW)
Concurrent Read, Exclusive Write (CREW)
Exclusive Read, Concurrent Write (ERCW)
Concurrent Read, Concurrent Write (CRCW)
Need to deterministically specify the contents
of a memory location after concurrent write
Assign priorities to processors and store value
from processor with highest priority
Allow concurrent writes if the values being
written are the same
Max, min, average, or sum of values is stored
(numeric applications)
16
Page 16
16
Searching
N processors search a list S = {L1, L2, , Lm}
for the index of an item x
Assume x can appear many times and any
index will do
Step 1: for i = 1 to N in parallel
read x
Step 2: Sequentially search through sub-list
Si with m/N numbers
Return Ki = -1 if x not in Si, else index
Step 3: for i = 1 to N in parallel
if (Ki > 0) k = Ki
17
Page 17
17
Complexity
What is the complexity for the different shared
memory schemes?
Step 1
EREW: O(N)
Step 2
O(m/N)
Step 3
Total
O(N) O(N) + O(m/N)
CREW:
ERCW:
CRCW:
18
Page 18
18
Cray T3E
Up to 2,048 675MHz Alpha 21164
processors connected in 3D torus
Each node has 256MB-2GB local DRAM memory
Load and stores access global memory over network
Only local memory cached by on-chip caches
Alpha microprocessor surrounded by custom shell circuitry to
make it into effective MPP node. Shell provides:
external copy of on-chip cache tags to check against remote
writes to local memory, generates on-chip invalidates on match
address management to allow all of external physical memory
to be addressed
atomic memory operations (fetch&op)
support for hardware barriers/eureka to synchronize parallel
tasks
19
Read up on no hardware coherence for the Cray T3E, eureka?
Page 19
19
Bus-Based Cache-Coherent SMPs
P
$
Bus
Central
Memory
Small scale (<= 4 processors) bus-based SMPs by far the
most common parallel processing platform today
Bus provides broadcast and serialization point for simple
snooping cache coherence protocol
Modern microprocessors integrate support for this protocol
20
Page 20
20
Sun Starfire (UE10000)
Up to 64-way SMP using bus-based snooping protocol
P
Board Interconnect
Board Interconnect
4 processors + memory
module per system
board
Uses 4 interleaved address
busses to scale snooping
protocol
16x16 Data Crossbar
Memory
Module
Memory
Module
Separate data
transfer over
high bandwidth
crossbar
21
Interleaved multiple address busses. Hardest part of scaling up an SMP
system is its coherence bandwidth.
Read Starmicro paper.
Page 21
21
SGI Origin 2000
Large scale distributed directory SMP
Scales from 2 processor workstation
to 512 processor supercomputer
Node contains:
Two MIPS R10000 processors plus caches
Memory module including directory
Connection to global network
Connection to I/O
Scalable hypercube switching network
supports up to 64 two-processor nodes (128
processors total)
(Some installations up to 512 processors)
Read up on directories, and compare to snoopy protocols.
SGI origin paper.
Page 22
22
Diseconomies of Scale
Few customers require the largest machines
much smaller volumes sold
have to amortize development costs over smaller number of
machines
Different hardware required to support largest
machines
dedicated interprocessor networks for message passing MPPs
T3E shell circuitry
large backplane for Starfire
directory storage and routers in SGI Origin
Large machines cost more per processor than small
machines!
23
Page 23
23
Trends in High-End Server CPUs
To other
processing
nodes
Processor
Chip
Net I/F
CPU
1
Coherence
Engine
CPU
N
Shared L2
Mem. Cntl.
DRAM controllers on chip (UltraSPARC-III, Alpha 21364)
reduce main memory latency
increase memory bandwidth
On-chip network routers (Alpha 21364, Power-4)
Cheaper/faster connectivity for cache coherence traffic
Multiple processors on one chip
Chip multiprocessor systems (IBM Power-4)
Simultaneous multithreading (Pentium-4 Xeon)
Local
Off-chip
DRAM
24
INTEGRATION in VLSI. DRAM controllers on chip to reduce main memory
latency (250ns to 170 ns in Ultrasparc III) and
Increase memory bandwidth 12 GB/s over 8 Rambus channels in 21364.
Page 24
24
Clusters and Networks of Workstations
Connect multiple complete machines together using
standard fast interconnects
Little or no hardware development cost
Each node can boot separately and operate
independently
Interconnect can be attached at I/O bus (most common)
or on memory bus (higher speed but more difficult)
Clustering initially used to provide fault tolerance
Clusters of SMPs (CluMPs)
Connect multiple n-way SMPs using a cachecoherent memory bus, fast message passing
network or non cache-coherent interconnect
Build message passing MPP by connecting
multiple workstations using fast interconnect
connected to I/O Bus. Main advantage?
25
HP Exemplar cache coherent, SGI Power Challenge Array, Vector
supercomputers NEC SX-5
Ucberkeley NOW, 100 Sun workstations connected by a Myrinet network.
What is Myrinet?
Sold commercially by IBM (SP-2), and Cray selling networks of Alphas.
Page 25
25
The Future?
26
Page 26
26