[go: up one dir, main page]

0% found this document useful (0 votes)
52 views78 pages

Fast Computer Arithmetic Guide

This document discusses computer arithmetic and integer addition. It reviews binary integers and two's complement representation. It then explains how integer addition works through half adders, full adders, and ripple-carry adders. A ripple-carry adder chains together full adders but has a delay linear in the number of bits, making it too slow for high performance computers. Faster adder designs are needed.

Uploaded by

Bleron Morina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views78 pages

Fast Computer Arithmetic Guide

This document discusses computer arithmetic and integer addition. It reviews binary integers and two's complement representation. It then explains how integer addition works through half adders, full adders, and ripple-carry adders. A ripple-carry adder chains together full adders but has a delay linear in the number of bits, making it too slow for high performance computers. Faster adder designs are needed.

Uploaded by

Bleron Morina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

COMPUTER

ARCHITECTURE
COMPUTER ARITHMETIC
Dr. Valon Raca
Notes
- Slides adapted and based on sources from University of Pennsylvania (Joseph
Devietti, Benedict Brown, C.J. Taylor, Milo Martin & Amir Roth)

2
This Unit: Arithmetic

⬗ A little review
App App App

System software

⬗ Binary + 2s complement
⬗ Ripple-carry addition (RCA)
Mem CPU I/O

⬗ Fast integer addition


⬗ Carry-select (CSeA)
⬗ Carry Lookahead Adder (CLA)
⬗ Shifters
⬗ Integer multiplication and division
⬗ Floating point arithmetic
3
Readings

⬗ P&H
⬗ Chapter 3
⬗ Appendix B.6

4
The Importance of Fast Arithmetic

+
4
Register
File Data
PC Insn s1 s2 d
Mem
Mem

Tinsn-mem Tregfile TALU Tdata-mem Tregfile


⬗ Addition of two numbers is most common operation
⬗ Programs use addition frequently
⬗ Loads and stores use addition for address calculation
⬗ Branches use addition to test conditions and calculate targets
⬗ All insns use addition to calculate default next PC
⬗ Fast addition is critical to high performance
5
Review: Binary Integers
⬗ Computers represent integers in binary (base2)
3 = 11, 4 = 100, 5 = 101, 30 = 11110
+ Natural since only two values are represented
⬗ Addition, etc. take place as usual (carry the 1, etc.)
17 = 10001
+5 = 101
22 = 10110

⬗ Some old machines use decimal (base10) with only 0/1


30 = 011 000
• Often called BCD (binary-coded decimal)
– Unnatural for digital logic, implementation complicated & slow
+ More precise for currency operations
6
Fixed Width
⬗ On pencil and paper, integers have infinite width

⬗ In hardware, integers have fixed width


⬗ N bits: 16, 32 or 64
⬗ LSB is 20, MSB is 2N-1
⬗ Range: 0 to 2N–1

⬗ Numbers >2N represented using multiple fixed-width integers


⬗ In software (e.g., Java BigInteger class)
7
What About Negative Integers?
⬗ Sign/magnitude
⬗ Unsigned plus one bit for sign
10 = 000001010, -10 = 100001010
+ Matches our intuition from “by hand” decimal arithmetic
– representations of both +0 and –0
– Addition is difficult
⬗ symmetric range: –(2N-1–1) to 2N-1–1

⬗ Option II: two’s complement (2C)


⬗ Leading 0s mean positive number, leading 1s negative
10 = 00001010, -10 = 11110110
+ One representation for 0 (all zeroes)
+ Easy addition
⬗ asymmetric range: –(2N-1) to 2N-1–1
8
The Tao of 2C
⬗ How did 2C come about?
⬗ “Let’s design a representation that makes addition easy”
⬗ Think of subtracting 10 from 0 by hand with 8-bit numbers
⬗ Have to “borrow” 1s from some imaginary leading 1
0 = 100000000
-10 = 00000010
-10 = 011111110

⬗ Now, add the conventional way…


-10 = 11111110
+10 = 00000010
0 = 100000000
9
Still More On 2C
⬗ What is the interpretation of 2C?
⬗ Same as binary, except MSB represents –2N–1, not 2N–1
⬗ –10 = 11110110 = –27+26+25+24+22+21
+ Extends to any width
⬗ –10 = 110110 = –25+24+22+21
⬗ Why? 2N = 2*2N–1
⬗ –25+24+22+21 = (–26+2*25)–25+24+22+21 = –26+25+24+22+21
⬗ Equivalent to computing modulo 2N

⬗ Trick to negating a number quickly: –B = B’ + 1


⬗ –(1) = (0001)’+1 = 1110+1 = 1111 = –1
⬗ –(–1) = (1111)’+1 = 0000+1 = 0001 = 1
⬗ –(0) = (0000)’+1 = 1111+1 = 0000 = 0
⬗ Think about why this works
10
Addition

11
1st Grade: Decimal Addition
1
43
+29
72

⬗ Repeat N times
⬗ Add least significant digits and any overflow from previous add
⬗ Carry “overflow” to next addition
⬗ Overflow: any digit other than least significant of sum
⬗ Shift two addends and sum one digit to the right

⬗ Sum of two N-digit numbers can yield an N+1 digit number

12
Binary Addition: Works the Same Way
1 111111
43 = 00101011
+29 = 00011101
72 = 01001000

⬗ Repeat N times
⬗ Add least significant bits and any overflow from previous add
⬗ Carry the overflow to next addition
⬗ Shift two addends and sum one bit to the right
⬗ Sum of two N-bit numbers can yield an N+1 bit number

– More steps (smaller base)


+ Each one is simpler (adding just 1 and 0)
⬗ So simple we can do it in hardware
13
The Half Adder
⬗ How to add two binary integers in hardware?
⬗ Start with adding two bits
⬗ When all else fails ... look at truth table
A
B S
A B = C0 S
0 0 = 0 0
0 1 = 0 1
1 0 = 0 1 CO
1 1 = 1 0
A S
⬗ S = A^B HA

⬗ CO (carry out) = AB
B
CO
⬗ This is called a half adder
14
The Other Half
⬗ We could chain half adders together, but to do that…
⬗ Need to incorporate a carry out from previous adder
C A B = C0 S CI
0 0 0 = 0 0
0 0 1 = 0 1
S
0 1 0 = 0 1 CI
0 1 1 = 1 0 A
A S
1 0 0 = 0 1 B
FA
1 0 1 = 1 0 B
1 1 0 = 1 0 CO
1 1 1 = 1 1

CO
⬗ S = C’A’B + C’AB’ + CA’B’ + CAB = C ^ A ^ B
⬗ CO = C’AB + CA’B + CAB’ + CAB = CA + CB + AB
⬗ This is called a full adder

15
Ripple-Carry Adder
⬗ N-bit ripple-carry adder
0

⬗ N 1-bit full adders “chained” together A0


FA
S0
B0
⬗ CO0 = CI1, CO1 = CI2, etc.
⬗ CI0 = 0
A1 S1
FA
B1
⬗ CON–1 is carry-out of entire adder
A2 S2
● CON–1 = 1 → “overflow” B2
FA


⬗ Example: 16-bit ripple carry adder A15 S15
⬗ How fast is this? B15
FA

⬗ How fast is an N-bit ripple-carry adder? CO

16
Quantifying Adder Delay
⬗ Combinational logic dominated by gate delays
⬗ How many gates between input and output?
⬗ Array storage dominated by wire delays
⬗ Longest delay or critical path is what matters

⬗ Can implement any combinational function in “2” logic levels


⬗ 1 level of AND + 1 level of OR (PLA)
⬗ NOTs are “free”: push to input (DeMorgan’s) or read from latch
⬗ Example: delay(FullAdder) = 2
⬗ d(CarryOut) = delay(AB + AC + BC)
⬗ d(Sum) = d(A ^ B ^ C) = d(AB’C’ + A’BC’ + A’B’C + ABC) = 2
⬗ Note ‘^’ means Xor (just like in C & Java)

⬗ Caveat: “2” assumes gates have few (<8 ?) inputs


17
Ripple-Carry Adder Delay
⬗ Longest path is to CO15 (or S15) 0

⬗ delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15)) A0 S0
⬗ d(A15) = d(B15) = 0, d(CI15) = d(CO14)
FA
B0
⬗ d(CO15) = 2 + d(CO14) = 2 + 2 + d(CO13) … A1 S1
⬗ d(CO15) = 32 B1
FA

A2 S2
⬗ d(CON–1) = 2N B2
FA

– Too slow! …
– Linear in number of bits
A15 S15
FA
⬗ Number of gates is also linear B15
CO

18
Fast Addition

19
Bad idea: a PLA-based Adder?
⬗ If any function can be expressed as two-level logic…
⬗ …why not use a PLA for an entire 8-bit adder?
⬗ Not small
⬗ Approx. 216 AND gates, each with 16 inputs
⬗ Then, 8 OR gates, each with 216 inputs
⬗ Number of gates exponential in bit width!
⬗ Not that fast, either
⬗ An AND gate with 65K inputs != 2-input AND gate
⬗ Many-input gates made from tree of, say, 4-input gates
⬗ 216-input gates would have at least 8 logic levels
⬗ So, at least 10 levels of logic for a 16-bit PLA
⬗ Even so, delay is still logarithmic in number of bits
⬗ There are better (faster, smaller) ways
20
Theme: Hardware != Software
⬗ Hardware can do things that software fundamentally can’t
⬗ And vice versa (of course)

⬗ In hardware, it’s easier to trade resources for latency

⬗ One example of this: speculation


⬗ Slow computation waiting for some slow input?
⬗ Input one of two things?
⬗ Compute with both (slow), choose right one later (fast)

⬗ Does this make sense in software? Not on a uni-processor


⬗ Difference? hardware is parallel, software is sequential
21
Carry-Select Adder
⬗ Carry-select adder
⬗ Do A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1
⬗ Choose the correct one when CO7 finally becomes available
+ Effectively cuts carry chain in half (break critical path)
– But adds mux 0
⬗ Delay? 0 A7-0
8+ S7-0
B7-0
A15-0 16
S15-0
B15-0 16+ 0 S15-8
A15-8
8+
B15-8 S15-8
CO 1 S15-816 CO
A15-8 18
8+
B15-8

22
Multi-Segment Carry-Select Adder
0
⬗ Multiple segments A4-0
5+ S4-0
⬗ Example: 5, 5, 6 bit = 16 bit
B4-0 10
0 S9-5
A9-5
5+
⬗ Hardware cost B9-5 S9-5

⬗ Still mostly linear (~2x) 1


A9-5
10
S9-5 12
⬗ Compute each segment B9-5
5+

with 0 and 1 carry-in


⬗ Serial mux chain 0
A15-10
6+
S15-10
B15-10 S15-10
12
⬗ Delay 1 S15-10 CO
14
⬗ 5-bit adder (10) +
A15-10
6+
B15-10
Two muxes (4) = 14
23
Carry-Select Adder Delay
⬗ What is carry-select adder delay (two segment)?
⬗ d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2
⬗ d(CO15) = MAX(2*8, 2*8) + 2 = 18
⬗ In general: 2*(N/2) + 2 = N+2 (vs 2N for RCA)
⬗ What if we cut adder into 4 equal pieces?
⬗ Would it be 2*(N/4) + 2 = 10? Not quite
⬗ d(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2
⬗ d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2
⬗ d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2
⬗ d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2
⬗ d(CO15) = 2*4 + 3*2 = 14
⬗ N-bit adder in M equal pieces: 2*(N/M) + (M–1)*2
⬗ 16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18
24
16b Carry-Select Adder Delay
35

30

25
Gate delay

20

15

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
M

25
Another Option: Carry Lookahead
⬗ Is carry-select adder as fast as we can go?
⬗ Nope!

⬗ Another approach to using additional resources


⬗ Instead of redundantly computing sums assuming different
carries
⬗ Use redundancy to compute carries more quickly
⬗ This approach is called carry lookahead (CLA)

26
A & B → Generate & Propagate
⬗ Let’s look at the single-bit carry-out function
⬗ CO = AB+AC+BC
⬗ Factor out terms that use only A and B (available immediately)
⬗ CO =(AB)+(A+B)C
⬗ (AB): generates carry-out regardless of incoming C → rename
to G C C
⬗ (A+B) : propagates
incoming C → rename to P A
P
A

⬗ CO = G+PC B
B
G

CO CO
27
Infinite Hardware CLA
⬗ Can expand C1…N in terms of G’s, P’s, and C0
⬗ Example: C16
⬗ C16 = G15+P15C15
⬗ C16 = G15+P15(G14+P14C14)
⬗ C16 = G15+P15G14+ … + P15P14…P2P1G0 + P15P14…P2P1P0C0
⬗ Similar expansions for C15, C14, etc.

⬗ How much does this cost?


⬗ CN needs: N AND’s + 1 OR’s, largest have N+1 inputs
⬗ CN…C1 needs: N*(N+1)/2 AND’s + N OR’s, max N+1 inputs
⬗ N=16: 152 total gates, max input 17
⬗ N=64: 2144 total gates, max input 65
⬗ And how fast is it really?
– Not that fast, unfortunately, 17-input gates are really slow
28
Compromise: Multi-Level CLA
⬗ Ripple carry
+ Few small gates: 2N gates, max 3 inputs
– Laid in series: 2N (linear) latency

⬗ Infinite hardware CLA


– Many big gates: N*(N+3)/2 additional gates, max N+1 inputs
+ Laid in parallel: constant 5 latency

⬗ Is there a compromise?
⬗ Reasonable number of small gates?
⬗ Sublinear (doesn’t have to be constant) latency?
⬗ Yes, multi-level CLA: exploits hierarchy to achieve this
29
Carry Lookahead Adder (CLA)
⬗ Calculate “propagate” and “generate” C0

based on A, B A0
B0
G0
P0
⬗ Not based on carry in C1
G1-0
P1-0
C1
⬗ Combine with tree structure A1
B1
G1
P1 G3-0
C2 P3-0
C2
A2 G2

⬗ Take aways
B2 P2 G3-2
C3 P3-2

⬗ Tree gives logarithmic delay


C3
A3 G3
B3 P3
⬗ Reasonable area
C4 C4

30
Two-Level CLA for 4-bit Adder
⬗ Individual carry equations
⬗ C1 = G0+P0C0, C2 = G1+P1C1, C3 = G2+P2C2, C4 = G3+P3C3
⬗ Infinite hardware CLA equations
⬗ C1 = G0+P0C0
⬗ C2 = G1+P1G0+P1P0C0
⬗ C3 = G2+P2G1+P2P1G0+P2P1P0C0
⬗ C4 = G3+P3G2+P3P2G1+P3P2P1G0+P3P2P1P0C0
⬗ Hierarchical CLA equations
⬗ First level: expand C2 using C1, C4 using C3
⬗ C2 = G1+P1(G0+P0C0) = (G1+P1G0)+(P1P0)C0 = G1-0+P1-0C0
⬗ C4 = G3+P3(G2+P2C2) = (G3+P3G2)+(P3P2)C2 = G3-2+P3-2C2
⬗ Second level: expand C4 using expanded C2
⬗ C4 = G3-2+P3-2(G1-0+P1-0C0) = (G3-2+P3-2G1-0)+(P3-2P1-0)C0
⬗ C4 = G3-0+P3-0C0
31
Two-Level CLA for 4-bit Adder
⬗ Top first-level CLA block
⬗ Input: C0, G0, P0, G1, P1 C0

⬗ Output: C1, G1-0, P1-0 A0 G0

⬗ Bottom first-level CLA block B0 P0


G1-0

⬗ Input: C2, G2, P2, G3, P3


C1 P1-0
C1
⬗ Output: C3, G3-2, P3-2 A1
B1
G1
P1 G3-0

⬗ Second-level CLA block C2 P3-0


C2
⬗ Input: C0, G1-0, P1-0, G3-2, P3-2 A2 G2

⬗ Output: C2, G3-0, P3-0


B2 P2 G3-2
C3 P3-2

⬗ These 3 blocks are “the same”


C3
A3 G3

⬗ C4 block
B3 P3

⬗ Input: C0, G3-0, P3-0 C4


⬗ Output: C4
C4

32
4-bit CLA Timing: d0
⬗ Which signals are ready at 0 gate delays (d0)?
⬗ C0 C0

⬗ A0, B0 A
B
0G
P
0

⬗ A1, B1
0 0
G1-0
C1 P1-0

⬗ A2, B2
C1
A 1G 1
B 1P 1

⬗ A3, B3
G3-0
C2 P3-0
C2
A2 G2
B2 P2 G3-2
C3 P3-2
C3
A3 G3
B3 P3

C4 C4

33
4-bit CLA Timing: d1
⬗ Signals ready from before C0

⬗ d0: C0, Ai, Bi A0


B0
G0
P0
G1-0
⬗ New signals ready at d1 C1 P1-0
C1

⬗ P0 = A0+B0, G0 = A0B0
A1 G1
B1 P1 G3-0

⬗ P1 = A1+B1, G1 = A1B1
C2 P3-0
C2
A2 G2
⬗ P2 = A2+B2, G2 = A2B2 C3
B2 P2 G3-2

⬗ P3 = A3+B3, G3 = A3B3
P3-2
C3
A3 G3
B3 P3

C4 C4

34
4-bit CLA Timing: d3
⬗ Signals ready from before
⬗ d0: C0, Ai, Bi A0
B0
G0
P0
⬗ d1: Pi, Gi C0 C1
G1-0
P1-0
C1
⬗ New signals ready at d3
A1 G1
B1 P1 G3-0

⬗ P1-0 = P1P0
C2 P3-0
C2

⬗ G1-0 = G1+P1G0
A2 G2
B2 P2 G3-2

⬗ C1 = G0+P0C0
C3 P3-2
C3
A3 G3
B3 P3

⬗ P3-2 = P3P2 C4 C4

⬗ G3-2 = G3+P3G2
⬗ C3 is not ready
35
4-bit CLA Timing: d5
⬗ Signals ready from before C0

⬗ d0: C0, Ai, Bi A0


B0
G0
P0
⬗ d1: Pi, Gi C1
G1-0
P1-0

⬗ d3: P1-0, G1-0 , C1 , P3-2 , G3-2


C1
A1 G1
B1 P1 G3-0

⬗ New signals ready at d5


C2 P3-0
C2

⬗ P3-0 = P3-2P1-0
A2 G2
B2 P2 G3-2

⬗ G3-0 = G3-2+P3-2G1-0
C3 P3-2
C3
A3 G3
⬗ C2 = G1-0+P1-0C0 B3 P3

C4 C4

36
4-bit CLA Timing: d7
⬗ Signals ready from before C0

⬗ d0: C0, Ai, Bi A0 G0

⬗ d1: Pi, Gi
B0 P0
G1-0
C1 P1-0

⬗ d3: P1-0, G1-0 , C1 , P3-2 , G3-2


C1
A1 G1
B1 P1 G3-0
⬗ d5: P3-0, G3-0, C2 C2 P3-0
C2

⬗ New signals ready at d7


A2 G2
B2 P2 G3-2

⬗ C3 = G2+P2C2
C3 P3-2
C3

⬗ C4 = G3-0+P3-0C0
A3 G3
B3 P3

C4
⬗ Si ready d2 after Ci
C4

37
Two-Level CLA for 4-bit Adder
⬗ Hardware?
⬗ CLA blocks: 5 gates, max 3 inputs (infinite CLA C0

with N=2)
⬗ 3 of these
A0 G0
B0 P0


G1-0
C4 block: 2 gates, max 2 inputs C1 P1-0

⬗ Total: 17 gates/3inputs A1 G1
C1

⬗ Infinite: 14/5 C2
B1 P1 G3-0
P3-0
C2
A2 G2
⬗ Latency? B2 P2 G3-2

⬗ 2 for “outer” CLA, 4 for “interior” C3 P3-2

⬗ G/P go “up”, C go “down”


C3
A3 G3

⬗ Total: 9 (1 for GP, 2 for S) B3 P3

⬗ Infinite: 5 C4 C4

⬗ 2L: more gates and slower?


38
Individual G & P → Group G & P
⬗ Group G/P: convenient abstraction for multi-level CLA

⬗ Individual carry equations


⬗ C1 = G0+P0C0, C2 = G1+P1C1
⬗ Infinite hardware CLA equations
⬗ C1 = G0+P0C0
⬗ C2 = G1+P1G0+P1P0C0
⬗ Group terms
⬗ C2 = (G1+P1G0)+(P1P0)C0
⬗ C2 = G1-0+P1-0C0
⬗ G1-0, P1-0 are group G & P: treat multiple bits as a single bit
⬗ G1-0: carry-out generated by 1-0 bits
⬗ P1-0: carry-out propagated by 1-0 bits
39
Two-Level CLA for 16-bit Adder
⬗ 4 G/P inputs per level
C0

G3-0
P3-0


C1,2,3
Hardware?
⬗ First level: 14 gates * 4 blocks
C4

⬗ Second level: 14 gates * 1 block G7-4

⬗ Total: 70 gates
P7-4
C5,6,7
⬗ largest gate: 5 inputs G15-0
⬗ Infinite: 152 gates, 17 inputs
C8
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
⬗ Latency? C12
⬗ Total: 9 (1 + 2 + 2 + 2 + 2)
⬗ Infinite: 5 G15-12
P15-12
C13,14,15

⬗ CLA for a 64-bit adder?


C16

40
16-bit CLA Timing: d0
⬗ What is ready after 0 gate
C0

G3-0

delays? P3-0
C1,2,3

⬗ C0 C4

G7-4
P7-4
C5,6,7

C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12

G15-12
P15-12
C13,14,15

C16

41
16-bit CLA Timing: d1
C0

⬗ What is ready after 1 gate delay? G3-0

⬗ Individual G/P
P3-0
C1,2,3
C4

G7-4
P7-4
C5,6,7

C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12

G15-12
P15-12
C13,14,15

C16

42
16-bit CLA Timing: d3
C0

⬗ What is ready after 3 gate delays?


⬗ 1st level group G/P
G3-0
P3-0
C1,2,3

⬗ Interior carries of 1st group C4

⬗ C1, C2, C3 G7-4


P7-4
C5,6,7

C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12

G15-12
P15-12
C13,14,15

C16
43
16-bit CLA Timing: d5
C0
⬗ And after 5 gate delays?
⬗ Outer level group G/P
G3-0
P3-0
C1,2,3
⬗ Outer level “interior” carries C4

⬗ C4, C8, C12 G7-4


P7-4
C5,6,7

C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12

G15-12
P15-12
C13,14,15

C16
44
16-bit CLA Timing: d7
⬗ And after 7 gate delays?
C0

⬗ C16 G3-0
P3-0
⬗ First level “interior” carries C1,2,3

⬗ C5, C6, C7
C4

⬗ C9, C10, C11 G7-4


P7-4
⬗ C13, C14, C15 C5,6,7

⬗ Essentially, all remaining carries C8 G15-0


P15-0
G11-8 C4,8,12
P11-8
⬗ Si ready 2 gate delays after Ci C9,10,11

⬗ All sum bits ready after 9 delays! C12

⬗ Same as 2-level 4-bit CLA G15-12


P15-12
⬗ All 2-level CLAs have similar delay structure C13,14,15

C16
45
Adders In Real Processors
⬗ Real processors super-optimize their adders
⬗ Ten or so different versions of CLA
⬗ Highly optimized versions of carry-select
⬗ Other gate techniques: carry-skip, conditional-sum
⬗ Sub-gate (transistor) techniques: Manchester carry chain
⬗ Combinations of different techniques
⬗ Alpha 21264 used CLA+CSeA+RippleCA
⬗ Used at different levels

⬗ Even more optimizations for incrementers


⬗ Why?

46
47
Subtraction: Addition’s Tricky Pal
⬗ Sign/magnitude subtraction is mental reverse addition
⬗ 2C subtraction is addition
⬗ How to subtract using an adder?
⬗ sub A B = add A -B
⬗ Negate B before adding (fast negation trick: –B = B’ + 1)
⬗ Isn’t a subtraction then a negation and two additions?
+ No, an adder can implement A+B+1 by setting the carry-in to 1
0
1
A
B

~
48
Shifts & Rotates

49
Shift and Rotation Instructions
⬗ Left/right shifts are useful…
⬗ Fast multiplication/division by small constants (next)
⬗ Bit manipulation: extracting and setting individual bits in words

⬗ Right shifts
⬗ Can be logical (shift in 0s) or arithmetic (shift in copies of MSB)
srl 110011, 2 = 001100
sra 110011, 2 = 111100
⬗ Caveat: for negative numbers, sra is not equal to division by 2
⬗ Consider: -1 / 16 = ?

⬗ Rotations are less useful…


⬗ But almost “free” if shifter is there
⬗ MIPS and LC4 have only shifts, x86 has shifts and rotations
50
Compiler Opt: Strength Reduction
⬗ Strength reduction: compilers will do this (sort of)
A * 4 = A << 2
A * 5 = (A << 2) + A
A / 8 = A >> 3 (only if A is unsigned)

⬗ Useful for address calculation: all basic data types are 2M in size
int A[100];
&A[N] = A+(N*sizeof(int)) = A+N*4 = A+N<<2

51
A Simple Shifter
⬗ The simplest 16-bit shifter: can only shift left by 1
⬗ Implement using wires (no logic!)
⬗ Slightly more complicated: can shift left by 1 or 0
⬗ Implement using wires and a multiplexor (mux16_2to1)
0 A0

A O

A15

A O
A <<1 O <<1

52
Barrel Shifter
⬗ What about shifting left by any amount 0–15?

⬗ 16 consecutive “left-shift-by-1-or-0” blocks?


– Would take too long (how long?)
⬗ Barrel shifter: 4 “shift-left-by-X-or-0” blocks (X = 1,2,4,8)
⬗ What is the delay?
A O
<<8 <<4 <<2 <<1
shift[3] shift[2] shift[1] shift[0]
shift

53
Shifter in Verilog
⬗ Logical shift operators << >>
⬗ performs zero-extension for >>
wire [15:0] a = b << c[3:0];
⬗ Arithmetic shift operator >>>
⬗ performs sign-extension
⬗ requires a signed wire input
wire signed [15:0] b;
wire [15:0] a = b >>> c[3:0];

54
Multiplication

55
3rd Grade: Decimal Multiplication
19 // multiplicand
* 12 // multiplier
38
+ 190
228 // product

⬗ Start with product 0, repeat steps until no multiplier digits


⬗ Multiply multiplicand by least significant multiplier digit
⬗ Add to product
⬗ Shift multiplicand one digit to the left (multiply by 10)
⬗ Shift multiplier one digit to the right (divide by 10)

⬗ Product of N-digit and M-digit numbers may have N+M digits


56
Binary Multiplication: Same Refrain
19 = 010011 // multiplicand
* 12 = 001100 // multiplier
0 = 000000000000
0 = 000000000000
76 = 000001001100
152 = 000010011000
0 = 000000000000
+ 0 = 000000000000
228 = 000011100100 // product

± Smaller base → more steps, each is simpler


⬗ Multiply multiplicand by least significant multiplier digit
+ 0 or 1 → no actual multiplication, add multiplicand or not
⬗ Add to total: we know how to do that
⬗ Shift multiplicand left, multiplier right by one digit
57
Software Multiplication
⬗ Can implement this algorithm in software
⬗ Inputs: md (multiplicand) and mr (multiplier)

int pd = 0; // product
int i = 0;
for (i = 0; i < 16 && mr != 0; i++) {
if (mr & 1) {
pd = pd + md;
}
md = md << 1; // shift left
mr = mr >> 1; // shift right
}
58
Hardware Multiply: Sequential

<< 1
Multiplicand Multiplier >> 1
(32 bit) (16 bit)

lsb==1?
32+
32
Product
(32 bit) we

⬗ Control: repeat 16 times


⬗ If least significant bit of multiplier is 1…
⬗ Then add multiplicand to product
⬗ Shift multiplicand left by 1
⬗ Shift multiplier right by 1
59
Hardware Multiply: Combinational
C<<4 C<<3 C<<4

16+

16+
0 0
C<<3 0

16+
0 P

16+
C<<2

16+
C<<2

16+
0 0 P
C<<1

16+
C<<1

16+
0 0

C C
0 0

⬗ Multiply by N bits at a time using N adders


⬗ Example: N=5, terms (P=product, C=multiplicand, M=multiplier)
⬗ P = (M[0] ? (C) : 0) + (M[1] ? (C<<1) : 0) +
(M[2] ? (C<<2) : 0) + (M[3] ? (C<<3) : 0) + …
⬗ Arrange like a tree to reduce gate delay critical path
⬗ Delay? N2 vs N*log N? Not that simple, depends on adder
⬗ Approx “2N” versus “N + log N”, with optimization: O(log N)
60
Partial Sums/Carries
⬗ Observe: carry-outs don’t have to be chained immediately
⬗ Can be saved for later and added back in
00111 = 7
+00011 = 3
00100 // partial sums (sums without carrries)
+00110 // partial carries (carries without sums)
01010 = 10

⬗ Partial sums/carries use simple half-adders, not full-adders


+ Aren’t “chained” → can be done in two levels of logic
– Must sum partial sums/carries eventually, and this sum is chained
⬗ d(CS-adder) = 2 + d(normal-adder)

61
Division

62
4th Grade: Decimal Division
9 // quotient
3 |29 // divisor | dividend
-27
2 // remainder

⬗ Shift divisor left (multiply by 10) until MSB lines up with dividend’s
⬗ Repeat until remaining dividend (remainder) < divisor
⬗ Find largest single digit q such that (q*divisor) < dividend
⬗ Set LSB of quotient to q
⬗ Subtract (q*divisor) from dividend
⬗ Shift quotient left by one digit (multiply by 10)
⬗ Shift divisor right by one digit (divide by 10)

63
Binary Division

1001 = 9
3 |29 = 0011 |011101
-24 = - 011000
5 = 000101
- 3 = - 000011
2 = 000010

64
Binary Division Hardware
⬗ Same as decimal division, except (again)
– More individual steps (base is smaller)
+ Each step is simpler
⬗ Find largest bit q such that (q*divisor) < dividend
⬗ q = 0 or 1
⬗ Subtract (q*divisor) from dividend
⬗ q = 0 or 1 → no actual multiplication, subtract divisor or not

⬗ Complication: largest q such that (q*divisor) < dividend


⬗ How do you know if (1*divisor) < dividend?
⬗ Human can “eyeball” this
⬗ Computer does not have eyeballs
⬗ Subtract and see if result is negative
65
Software Divide Algorithm
⬗ Can implement this algorithm in software
⬗ Inputs: dividend and divisor

remainder = 0;
quotient = 0;
for (int i = 0; i < 32; i++) {
remainder = (remainder << 1) | (dividend >> 31);
if (remainder >= divisor) {
quotient = (quotient << 1) | 1;
remainder = remainder - divisor;
} else {
quotient = (quotient << 1) | 0;
}
dividend = dividend << 1;
}
66
Divide Example
⬗ Input: Divisor = 00011 , Dividend = 11101

Step Remainder Quotient Remainder Dividend


0 00000 00000 00000 11101
1 00001 00000 00001 11010
2 00011 00001 00000 10100
3 00001 00010 00001 01000
4 00010 00100 00010 10000
5 00101 01001 00010 00000

⬗ Result: Quotient: 1001, Remainder: 10

67
Divider Circuit
Divisor
Quotient

Sub >=0
Shift in 0 or 1

Remainder Dividend

Shift in 0 or 1
msb
Shift in 0

⬗ N cycles for n-bit divide


68
Floating Point

69
Floating Point (FP) Numbers
⬗ Floating point numbers: numbers in scientific notation
⬗ Two uses

⬗ Use I: real numbers (numbers with non-zero fractions)


⬗ 3.1415926…
⬗ 2.1878…
⬗ 6.62 * 10–34

⬗ Use II: really big numbers


⬗ 3.0 * 108
⬗ 6.02 * 1023
⬗ Aside: best not used for currency values
70
Scientific Notation
⬗ Scientific notation:
⬗ Number [S,F,E] = S * F * 2E
⬗ S: sign
⬗ F: significand (fraction)
⬗ E: exponent
⬗ “Floating point”: binary (decimal) point has different magnitude
+ “Sliding window” of precision using notion of significant digits
⬗ Small numbers very precise, many places after decimal point
⬗ Big numbers are much less so, not all integers representable
⬗ But for those instances you don’t really care anyway
– Caveat: all representations are just approximations
⬗ Sometimes wierdos like 0.9999999 or 1.0000001 come up
+ But good enough for most purposes
71
IEEE 754 Standard Precision/Range
⬗ Single precision: float in C
⬗ 32-bit: 1-bit sign + 8-bit exponent + 23-bit significand
⬗ Range: 2.0 * 10–38 < N < 2.0 * 1038
⬗ Precision: ~7 significant (decimal) digits
⬗ Used when exact precision is less important (e.g., 3D games)

⬗ Double precision: double in C


⬗ 64-bit: 1-bit sign + 11-bit exponent + 52-bit significand
⬗ Range: 2.0 * 10–308 < N < 2.0 * 10308
⬗ Precision: ~15 significant (decimal) digits
⬗ Used for scientific computations

⬗ Numbers >10308 don’t come up in many calculations


⬗ 1080 ~ number of atoms in universe
72
73
74
Floating Point is Inexact
⬗ Accuracy problems sometimes get bad
⬗ FP arithmetic not associative: (A+B)+C not same as A+(B+C)
⬗ Addition of big and small numbers (summing many small numbers)
⬗ Subtraction of two big numbers
⬗ Example, what’s (1*10030 + 1*100) – 1*1030?
⬗ Intuitively:301*10 = 01
⬗ But: (1*10 + 1*10 ) – 1*1030 = (1*1030 – 1*1030) = 0
⬗ Reciprocal math: “x/y” versus ”x*(1/y)”
⬗ Reciprocal & multiply is faster than divide, but less precise
⬗ Compilers are generally conservative by default
⬗ GCC flag: –ffast-math (allows assoc. opts, reciprocal math)
⬗ Numerical analysis: field formed around this problem
⬗ Re-formulating algorithms in a way that bounds numerical error
⬗ In your code: never test for equality between FP numbers
⬗ Use something like: if (abs(a-b) < 0.00001) then …
75
Pentium FDIV Bug
⬗ Pentium shipped in August 1994
⬗ Intel actually knew about the bug in July
⬗ But calculated that delaying the project a month would cost ~$1M
⬗ And that in reality only a dozen or so people would encounter it
⬗ They were right… but one of them took the story to EE times
⬗ By November 1994, firestorm was full on
⬗ IBM said that typical Excel user would encounter bug every month
⬗ Assumed 5K divisions per second around the clock
⬗ People believed the story
⬗ IBM stopped shipping Pentium PCs
⬗ By December 1994, Intel promises full recall
⬗ Total cost: ~$550M
76
Arithmetic Latencies
⬗ Latency in cycles of common arithmetic operations
⬗ Source: Agner Fog, https://www.agner.org/optimize/#manuals
⬗ AMD Ryzen core

Int 32 Int 64 Fp 32 Fp 64
Add/Subtract 1 1 5 5
Multiply 3 3 5 5
Divide 14-30 14-46 8-15 8-15

⬗ Divide is variable latency based on the size of the dividend


⬗ Detect number of leading zeros, then divide
⬗ Why is FP divide faster than integer divide?
77
Summary
App App App ⬗ Integer addition
System software
⬗ Most timing-critical operation in datapath
⬗ Hardware != software
⬗ Exploit sub-addition parallelism
Mem CPU I/O

⬗ Fast addition
⬗ Carry-select: parallelism in sum
⬗ Multiplication
⬗ Chains and trees of additions
⬗ Division
⬗ Floating point

78

You might also like