0% found this document useful (0 votes)

52 views78 pages

Fast Computer Arithmetic Guide

This document discusses computer arithmetic and integer addition. It reviews binary integers and two's complement representation. It then explains how integer addition works through half adders, full adders, and ripple-carry adders. A ripple-carry adder chains together full adders but has a delay linear in the number of bits, making it too slow for high performance computers. Faster adder designs are needed.

Uploaded by

Bleron Morina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views78 pages

Fast Computer Arithmetic Guide

Uploaded by

Bleron Morina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

COMPUTER

ARCHITECTURE
COMPUTER ARITHMETIC
Dr. Valon Raca
Notes
- Slides adapted and based on sources from University of Pennsylvania (Joseph
Devietti, Benedict Brown, C.J. Taylor, Milo Martin & Amir Roth)

2
This Unit: Arithmetic

⬗ A little review
App App App

System software

⬗ Binary + 2s complement
⬗ Ripple-carry addition (RCA)
Mem CPU I/O

⬗ Fast integer addition

⬗ Carry-select (CSeA)
⬗ Carry Lookahead Adder (CLA)
⬗ Shifters
⬗ Integer multiplication and division
⬗ Floating point arithmetic
3
Readings

⬗ P&H
⬗ Chapter 3
⬗ Appendix B.6

4
The Importance of Fast Arithmetic

+
4
Register
File Data
PC Insn s1 s2 d
Mem
Mem

Tinsn-mem Tregfile TALU Tdata-mem Tregfile

⬗ Addition of two numbers is most common operation
⬗ Programs use addition frequently
⬗ Loads and stores use addition for address calculation
⬗ Branches use addition to test conditions and calculate targets
⬗ All insns use addition to calculate default next PC
⬗ Fast addition is critical to high performance
5
Review: Binary Integers
⬗ Computers represent integers in binary (base2)
3 = 11, 4 = 100, 5 = 101, 30 = 11110
+ Natural since only two values are represented
⬗ Addition, etc. take place as usual (carry the 1, etc.)
17 = 10001
+5 = 101
22 = 10110

⬗ Some old machines use decimal (base10) with only 0/1

30 = 011 000
• Often called BCD (binary-coded decimal)
– Unnatural for digital logic, implementation complicated & slow
+ More precise for currency operations
6
Fixed Width
⬗ On pencil and paper, integers have infinite width

⬗ In hardware, integers have fixed width

⬗ N bits: 16, 32 or 64
⬗ LSB is 20, MSB is 2N-1
⬗ Range: 0 to 2N–1

⬗ Numbers >2N represented using multiple fixed-width integers

⬗ In software (e.g., Java BigInteger class)
7
What About Negative Integers?
⬗ Sign/magnitude
⬗ Unsigned plus one bit for sign
10 = 000001010, -10 = 100001010
+ Matches our intuition from “by hand” decimal arithmetic
– representations of both +0 and –0
– Addition is difficult
⬗ symmetric range: –(2N-1–1) to 2N-1–1

⬗ Option II: two’s complement (2C)

⬗ Leading 0s mean positive number, leading 1s negative
10 = 00001010, -10 = 11110110
+ One representation for 0 (all zeroes)
+ Easy addition
⬗ asymmetric range: –(2N-1) to 2N-1–1
8
The Tao of 2C
⬗ How did 2C come about?
⬗ “Let’s design a representation that makes addition easy”
⬗ Think of subtracting 10 from 0 by hand with 8-bit numbers
⬗ Have to “borrow” 1s from some imaginary leading 1
0 = 100000000
-10 = 00000010
-10 = 011111110

⬗ Now, add the conventional way…

-10 = 11111110
+10 = 00000010
0 = 100000000
9
Still More On 2C
⬗ What is the interpretation of 2C?
⬗ Same as binary, except MSB represents –2N–1, not 2N–1
⬗ –10 = 11110110 = –27+26+25+24+22+21
+ Extends to any width
⬗ –10 = 110110 = –25+24+22+21
⬗ Why? 2N = 2*2N–1
⬗ –25+24+22+21 = (–26+2*25)–25+24+22+21 = –26+25+24+22+21
⬗ Equivalent to computing modulo 2N

⬗ Trick to negating a number quickly: –B = B’ + 1

⬗ –(1) = (0001)’+1 = 1110+1 = 1111 = –1
⬗ –(–1) = (1111)’+1 = 0000+1 = 0001 = 1
⬗ –(0) = (0000)’+1 = 1111+1 = 0000 = 0
⬗ Think about why this works
10
Addition

11
1st Grade: Decimal Addition
1
43
+29
72

⬗ Repeat N times
⬗ Add least significant digits and any overflow from previous add
⬗ Carry “overflow” to next addition
⬗ Overflow: any digit other than least significant of sum
⬗ Shift two addends and sum one digit to the right

⬗ Sum of two N-digit numbers can yield an N+1 digit number

12
Binary Addition: Works the Same Way
1 111111
43 = 00101011
+29 = 00011101
72 = 01001000

⬗ Repeat N times
⬗ Add least significant bits and any overflow from previous add
⬗ Carry the overflow to next addition
⬗ Shift two addends and sum one bit to the right
⬗ Sum of two N-bit numbers can yield an N+1 bit number

– More steps (smaller base)

+ Each one is simpler (adding just 1 and 0)
⬗ So simple we can do it in hardware
13
The Half Adder
⬗ How to add two binary integers in hardware?
⬗ Start with adding two bits
⬗ When all else fails ... look at truth table
A
B S
A B = C0 S
0 0 = 0 0
0 1 = 0 1
1 0 = 0 1 CO
1 1 = 1 0
A S
⬗ S = A^B HA

⬗ CO (carry out) = AB
B
CO
⬗ This is called a half adder
14
The Other Half
⬗ We could chain half adders together, but to do that…
⬗ Need to incorporate a carry out from previous adder
C A B = C0 S CI
0 0 0 = 0 0
0 0 1 = 0 1
S
0 1 0 = 0 1 CI
0 1 1 = 1 0 A
A S
1 0 0 = 0 1 B
FA
1 0 1 = 1 0 B
1 1 0 = 1 0 CO
1 1 1 = 1 1

CO
⬗ S = C’A’B + C’AB’ + CA’B’ + CAB = C ^ A ^ B
⬗ CO = C’AB + CA’B + CAB’ + CAB = CA + CB + AB
⬗ This is called a full adder

15
Ripple-Carry Adder
⬗ N-bit ripple-carry adder
0

⬗ N 1-bit full adders “chained” together A0

FA
S0
B0
⬗ CO0 = CI1, CO1 = CI2, etc.
⬗ CI0 = 0
A1 S1
FA
B1
⬗ CON–1 is carry-out of entire adder
A2 S2
● CON–1 = 1 → “overflow” B2
FA

…
⬗ Example: 16-bit ripple carry adder A15 S15
⬗ How fast is this? B15
FA

⬗ How fast is an N-bit ripple-carry adder? CO

16
Quantifying Adder Delay
⬗ Combinational logic dominated by gate delays
⬗ How many gates between input and output?
⬗ Array storage dominated by wire delays
⬗ Longest delay or critical path is what matters

⬗ Can implement any combinational function in “2” logic levels

⬗ 1 level of AND + 1 level of OR (PLA)
⬗ NOTs are “free”: push to input (DeMorgan’s) or read from latch
⬗ Example: delay(FullAdder) = 2
⬗ d(CarryOut) = delay(AB + AC + BC)
⬗ d(Sum) = d(A ^ B ^ C) = d(AB’C’ + A’BC’ + A’B’C + ABC) = 2
⬗ Note ‘^’ means Xor (just like in C & Java)

⬗ Caveat: “2” assumes gates have few (<8 ?) inputs

17
Ripple-Carry Adder Delay
⬗ Longest path is to CO15 (or S15) 0

⬗ delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15)) A0 S0
⬗ d(A15) = d(B15) = 0, d(CI15) = d(CO14)
FA
B0
⬗ d(CO15) = 2 + d(CO14) = 2 + 2 + d(CO13) … A1 S1
⬗ d(CO15) = 32 B1
FA

A2 S2
⬗ d(CON–1) = 2N B2
FA

– Too slow! …
– Linear in number of bits
A15 S15
FA
⬗ Number of gates is also linear B15
CO

18
Fast Addition

19
Bad idea: a PLA-based Adder?
⬗ If any function can be expressed as two-level logic…
⬗ …why not use a PLA for an entire 8-bit adder?
⬗ Not small
⬗ Approx. 216 AND gates, each with 16 inputs
⬗ Then, 8 OR gates, each with 216 inputs
⬗ Number of gates exponential in bit width!
⬗ Not that fast, either
⬗ An AND gate with 65K inputs != 2-input AND gate
⬗ Many-input gates made from tree of, say, 4-input gates
⬗ 216-input gates would have at least 8 logic levels
⬗ So, at least 10 levels of logic for a 16-bit PLA
⬗ Even so, delay is still logarithmic in number of bits
⬗ There are better (faster, smaller) ways
20
Theme: Hardware != Software
⬗ Hardware can do things that software fundamentally can’t
⬗ And vice versa (of course)

⬗ In hardware, it’s easier to trade resources for latency

⬗ One example of this: speculation

⬗ Slow computation waiting for some slow input?
⬗ Input one of two things?
⬗ Compute with both (slow), choose right one later (fast)

⬗ Does this make sense in software? Not on a uni-processor

⬗ Difference? hardware is parallel, software is sequential
21
Carry-Select Adder
⬗ Carry-select adder
⬗ Do A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1
⬗ Choose the correct one when CO7 finally becomes available
+ Effectively cuts carry chain in half (break critical path)
– But adds mux 0
⬗ Delay? 0 A7-0
8+ S7-0
B7-0
A15-0 16
S15-0
B15-0 16+ 0 S15-8
A15-8
8+
B15-8 S15-8
CO 1 S15-816 CO
A15-8 18
8+
B15-8

22
Multi-Segment Carry-Select Adder
0
⬗ Multiple segments A4-0
5+ S4-0
⬗ Example: 5, 5, 6 bit = 16 bit
B4-0 10
0 S9-5
A9-5
5+
⬗ Hardware cost B9-5 S9-5

⬗ Still mostly linear (~2x) 1

A9-5
10
S9-5 12
⬗ Compute each segment B9-5
5+

with 0 and 1 carry-in

⬗ Serial mux chain 0
A15-10
6+
S15-10
B15-10 S15-10
12
⬗ Delay 1 S15-10 CO
14
⬗ 5-bit adder (10) +
A15-10
6+
B15-10
Two muxes (4) = 14
23
Carry-Select Adder Delay
⬗ What is carry-select adder delay (two segment)?
⬗ d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2
⬗ d(CO15) = MAX(2*8, 2*8) + 2 = 18
⬗ In general: 2*(N/2) + 2 = N+2 (vs 2N for RCA)
⬗ What if we cut adder into 4 equal pieces?
⬗ Would it be 2*(N/4) + 2 = 10? Not quite
⬗ d(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2
⬗ d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2
⬗ d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2
⬗ d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2
⬗ d(CO15) = 2*4 + 3*2 = 14
⬗ N-bit adder in M equal pieces: 2*(N/M) + (M–1)*2
⬗ 16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18
24
16b Carry-Select Adder Delay
35

25
Gate delay

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
M

25
Another Option: Carry Lookahead
⬗ Is carry-select adder as fast as we can go?
⬗ Nope!

⬗ Another approach to using additional resources

⬗ Instead of redundantly computing sums assuming different
carries
⬗ Use redundancy to compute carries more quickly
⬗ This approach is called carry lookahead (CLA)

26
A & B → Generate & Propagate
⬗ Let’s look at the single-bit carry-out function
⬗ CO = AB+AC+BC
⬗ Factor out terms that use only A and B (available immediately)
⬗ CO =(AB)+(A+B)C
⬗ (AB): generates carry-out regardless of incoming C → rename
to G C C
⬗ (A+B) : propagates
incoming C → rename to P A
P
A

⬗ CO = G+PC B
B
G

CO CO
27
Infinite Hardware CLA
⬗ Can expand C1…N in terms of G’s, P’s, and C0
⬗ Example: C16
⬗ C16 = G15+P15C15
⬗ C16 = G15+P15(G14+P14C14)
⬗ C16 = G15+P15G14+ … + P15P14…P2P1G0 + P15P14…P2P1P0C0
⬗ Similar expansions for C15, C14, etc.

⬗ How much does this cost?

⬗ CN needs: N AND’s + 1 OR’s, largest have N+1 inputs
⬗ CN…C1 needs: N*(N+1)/2 AND’s + N OR’s, max N+1 inputs
⬗ N=16: 152 total gates, max input 17
⬗ N=64: 2144 total gates, max input 65
⬗ And how fast is it really?
– Not that fast, unfortunately, 17-input gates are really slow
28
Compromise: Multi-Level CLA
⬗ Ripple carry
+ Few small gates: 2N gates, max 3 inputs
– Laid in series: 2N (linear) latency

⬗ Infinite hardware CLA

– Many big gates: N*(N+3)/2 additional gates, max N+1 inputs
+ Laid in parallel: constant 5 latency

⬗ Is there a compromise?
⬗ Reasonable number of small gates?
⬗ Sublinear (doesn’t have to be constant) latency?
⬗ Yes, multi-level CLA: exploits hierarchy to achieve this
29
Carry Lookahead Adder (CLA)
⬗ Calculate “propagate” and “generate” C0

based on A, B A0
B0
G0
P0
⬗ Not based on carry in C1
G1-0
P1-0
C1
⬗ Combine with tree structure A1
B1
G1
P1 G3-0
C2 P3-0
C2
A2 G2

⬗ Take aways
B2 P2 G3-2
C3 P3-2

⬗ Tree gives logarithmic delay

C3
A3 G3
B3 P3
⬗ Reasonable area
C4 C4

30
Two-Level CLA for 4-bit Adder
⬗ Individual carry equations
⬗ C1 = G0+P0C0, C2 = G1+P1C1, C3 = G2+P2C2, C4 = G3+P3C3
⬗ Infinite hardware CLA equations
⬗ C1 = G0+P0C0
⬗ C2 = G1+P1G0+P1P0C0
⬗ C3 = G2+P2G1+P2P1G0+P2P1P0C0
⬗ C4 = G3+P3G2+P3P2G1+P3P2P1G0+P3P2P1P0C0
⬗ Hierarchical CLA equations
⬗ First level: expand C2 using C1, C4 using C3
⬗ C2 = G1+P1(G0+P0C0) = (G1+P1G0)+(P1P0)C0 = G1-0+P1-0C0
⬗ C4 = G3+P3(G2+P2C2) = (G3+P3G2)+(P3P2)C2 = G3-2+P3-2C2
⬗ Second level: expand C4 using expanded C2
⬗ C4 = G3-2+P3-2(G1-0+P1-0C0) = (G3-2+P3-2G1-0)+(P3-2P1-0)C0
⬗ C4 = G3-0+P3-0C0
31
Two-Level CLA for 4-bit Adder
⬗ Top first-level CLA block
⬗ Input: C0, G0, P0, G1, P1 C0

⬗ Output: C1, G1-0, P1-0 A0 G0

⬗ Bottom first-level CLA block B0 P0

G1-0

⬗ Input: C2, G2, P2, G3, P3

C1 P1-0
C1
⬗ Output: C3, G3-2, P3-2 A1
B1
G1
P1 G3-0

⬗ Second-level CLA block C2 P3-0

C2
⬗ Input: C0, G1-0, P1-0, G3-2, P3-2 A2 G2

⬗ Output: C2, G3-0, P3-0

B2 P2 G3-2
C3 P3-2

⬗ These 3 blocks are “the same”

C3
A3 G3

⬗ C4 block
B3 P3

⬗ Input: C0, G3-0, P3-0 C4

⬗ Output: C4
C4

32
4-bit CLA Timing: d0
⬗ Which signals are ready at 0 gate delays (d0)?
⬗ C0 C0

⬗ A0, B0 A
B
0G
P
0

⬗ A1, B1
0 0
G1-0
C1 P1-0

⬗ A2, B2
C1
A 1G 1
B 1P 1

⬗ A3, B3
G3-0
C2 P3-0
C2
A2 G2
B2 P2 G3-2
C3 P3-2
C3
A3 G3
B3 P3

C4 C4

33
4-bit CLA Timing: d1
⬗ Signals ready from before C0

⬗ d0: C0, Ai, Bi A0

B0
G0
P0
G1-0
⬗ New signals ready at d1 C1 P1-0
C1

⬗ P0 = A0+B0, G0 = A0B0
A1 G1
B1 P1 G3-0

⬗ P1 = A1+B1, G1 = A1B1
C2 P3-0
C2
A2 G2
⬗ P2 = A2+B2, G2 = A2B2 C3
B2 P2 G3-2

⬗ P3 = A3+B3, G3 = A3B3
P3-2
C3
A3 G3
B3 P3

C4 C4

34
4-bit CLA Timing: d3
⬗ Signals ready from before
⬗ d0: C0, Ai, Bi A0
B0
G0
P0
⬗ d1: Pi, Gi C0 C1
G1-0
P1-0
C1
⬗ New signals ready at d3
A1 G1
B1 P1 G3-0

⬗ P1-0 = P1P0
C2 P3-0
C2

⬗ G1-0 = G1+P1G0
A2 G2
B2 P2 G3-2

⬗ C1 = G0+P0C0
C3 P3-2
C3
A3 G3
B3 P3

⬗ P3-2 = P3P2 C4 C4

⬗ G3-2 = G3+P3G2
⬗ C3 is not ready
35
4-bit CLA Timing: d5
⬗ Signals ready from before C0

⬗ d0: C0, Ai, Bi A0

B0
G0
P0
⬗ d1: Pi, Gi C1
G1-0
P1-0

⬗ d3: P1-0, G1-0 , C1 , P3-2 , G3-2

C1
A1 G1
B1 P1 G3-0

⬗ New signals ready at d5

C2 P3-0
C2

⬗ P3-0 = P3-2P1-0
A2 G2
B2 P2 G3-2

⬗ G3-0 = G3-2+P3-2G1-0
C3 P3-2
C3
A3 G3
⬗ C2 = G1-0+P1-0C0 B3 P3

C4 C4

36
4-bit CLA Timing: d7
⬗ Signals ready from before C0

⬗ d0: C0, Ai, Bi A0 G0

⬗ d1: Pi, Gi
B0 P0
G1-0
C1 P1-0

⬗ d3: P1-0, G1-0 , C1 , P3-2 , G3-2

C1
A1 G1
B1 P1 G3-0
⬗ d5: P3-0, G3-0, C2 C2 P3-0
C2

⬗ New signals ready at d7

A2 G2
B2 P2 G3-2

⬗ C3 = G2+P2C2
C3 P3-2
C3

⬗ C4 = G3-0+P3-0C0
A3 G3
B3 P3

C4
⬗ Si ready d2 after Ci
C4

37
Two-Level CLA for 4-bit Adder
⬗ Hardware?
⬗ CLA blocks: 5 gates, max 3 inputs (infinite CLA C0

with N=2)
⬗ 3 of these
A0 G0
B0 P0

⬗
G1-0
C4 block: 2 gates, max 2 inputs C1 P1-0

⬗ Total: 17 gates/3inputs A1 G1
C1

⬗ Infinite: 14/5 C2
B1 P1 G3-0
P3-0
C2
A2 G2
⬗ Latency? B2 P2 G3-2

⬗ 2 for “outer” CLA, 4 for “interior” C3 P3-2

⬗ G/P go “up”, C go “down”

C3
A3 G3

⬗ Total: 9 (1 for GP, 2 for S) B3 P3

⬗ Infinite: 5 C4 C4

⬗ 2L: more gates and slower?

38
Individual G & P → Group G & P
⬗ Group G/P: convenient abstraction for multi-level CLA

⬗ Individual carry equations

⬗ C1 = G0+P0C0, C2 = G1+P1C1
⬗ Infinite hardware CLA equations
⬗ C1 = G0+P0C0
⬗ C2 = G1+P1G0+P1P0C0
⬗ Group terms
⬗ C2 = (G1+P1G0)+(P1P0)C0
⬗ C2 = G1-0+P1-0C0
⬗ G1-0, P1-0 are group G & P: treat multiple bits as a single bit
⬗ G1-0: carry-out generated by 1-0 bits
⬗ P1-0: carry-out propagated by 1-0 bits
39
Two-Level CLA for 16-bit Adder
⬗ 4 G/P inputs per level
C0

G3-0
P3-0

⬗
C1,2,3
Hardware?
⬗ First level: 14 gates * 4 blocks
C4

⬗ Second level: 14 gates * 1 block G7-4

⬗ Total: 70 gates
P7-4
C5,6,7
⬗ largest gate: 5 inputs G15-0
⬗ Infinite: 152 gates, 17 inputs
C8
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
⬗ Latency? C12
⬗ Total: 9 (1 + 2 + 2 + 2 + 2)
⬗ Infinite: 5 G15-12
P15-12
C13,14,15

⬗ CLA for a 64-bit adder?

C16

40
16-bit CLA Timing: d0
⬗ What is ready after 0 gate
C0

G3-0

delays? P3-0
C1,2,3

⬗ C0 C4

G7-4
P7-4
C5,6,7

C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12

G15-12
P15-12
C13,14,15

C16

41
16-bit CLA Timing: d1
C0

⬗ What is ready after 1 gate delay? G3-0

⬗ Individual G/P
P3-0
C1,2,3
C4

G7-4
P7-4
C5,6,7

C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12

G15-12
P15-12
C13,14,15

C16

42
16-bit CLA Timing: d3
C0

⬗ What is ready after 3 gate delays?

⬗ 1st level group G/P
G3-0
P3-0
C1,2,3

⬗ Interior carries of 1st group C4

⬗ C1, C2, C3 G7-4

P7-4
C5,6,7

C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12

G15-12
P15-12
C13,14,15

C16
43
16-bit CLA Timing: d5
C0
⬗ And after 5 gate delays?
⬗ Outer level group G/P
G3-0
P3-0
C1,2,3
⬗ Outer level “interior” carries C4

⬗ C4, C8, C12 G7-4

P7-4
C5,6,7

C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12

G15-12
P15-12
C13,14,15

C16
44
16-bit CLA Timing: d7
⬗ And after 7 gate delays?
C0

⬗ C16 G3-0
P3-0
⬗ First level “interior” carries C1,2,3

⬗ C5, C6, C7
C4

⬗ C9, C10, C11 G7-4

P7-4
⬗ C13, C14, C15 C5,6,7

⬗ Essentially, all remaining carries C8 G15-0

P15-0
G11-8 C4,8,12
P11-8
⬗ Si ready 2 gate delays after Ci C9,10,11

⬗ All sum bits ready after 9 delays! C12

⬗ Same as 2-level 4-bit CLA G15-12

P15-12
⬗ All 2-level CLAs have similar delay structure C13,14,15

C16
45
Adders In Real Processors
⬗ Real processors super-optimize their adders
⬗ Ten or so different versions of CLA
⬗ Highly optimized versions of carry-select
⬗ Other gate techniques: carry-skip, conditional-sum
⬗ Sub-gate (transistor) techniques: Manchester carry chain
⬗ Combinations of different techniques
⬗ Alpha 21264 used CLA+CSeA+RippleCA
⬗ Used at different levels

⬗ Even more optimizations for incrementers

⬗ Why?

46
47
Subtraction: Addition’s Tricky Pal
⬗ Sign/magnitude subtraction is mental reverse addition
⬗ 2C subtraction is addition
⬗ How to subtract using an adder?
⬗ sub A B = add A -B
⬗ Negate B before adding (fast negation trick: –B = B’ + 1)
⬗ Isn’t a subtraction then a negation and two additions?
+ No, an adder can implement A+B+1 by setting the carry-in to 1
0
1
A
B

~
48
Shifts & Rotates

49
Shift and Rotation Instructions
⬗ Left/right shifts are useful…
⬗ Fast multiplication/division by small constants (next)
⬗ Bit manipulation: extracting and setting individual bits in words

⬗ Right shifts
⬗ Can be logical (shift in 0s) or arithmetic (shift in copies of MSB)
srl 110011, 2 = 001100
sra 110011, 2 = 111100
⬗ Caveat: for negative numbers, sra is not equal to division by 2
⬗ Consider: -1 / 16 = ?

⬗ Rotations are less useful…

⬗ But almost “free” if shifter is there
⬗ MIPS and LC4 have only shifts, x86 has shifts and rotations
50
Compiler Opt: Strength Reduction
⬗ Strength reduction: compilers will do this (sort of)
A * 4 = A << 2
A * 5 = (A << 2) + A
A / 8 = A >> 3 (only if A is unsigned)

⬗ Useful for address calculation: all basic data types are 2M in size
int A[100];
&A[N] = A+(N*sizeof(int)) = A+N*4 = A+N<<2

51
A Simple Shifter
⬗ The simplest 16-bit shifter: can only shift left by 1
⬗ Implement using wires (no logic!)
⬗ Slightly more complicated: can shift left by 1 or 0
⬗ Implement using wires and a multiplexor (mux16_2to1)
0 A0

A O

A15

A O
A <<1 O <<1

52
Barrel Shifter
⬗ What about shifting left by any amount 0–15?

⬗ 16 consecutive “left-shift-by-1-or-0” blocks?

– Would take too long (how long?)
⬗ Barrel shifter: 4 “shift-left-by-X-or-0” blocks (X = 1,2,4,8)
⬗ What is the delay?
A O
<<8 <<4 <<2 <<1
shift[3] shift[2] shift[1] shift[0]
shift

53
Shifter in Verilog
⬗ Logical shift operators << >>
⬗ performs zero-extension for >>
wire [15:0] a = b << c[3:0];
⬗ Arithmetic shift operator >>>
⬗ performs sign-extension
⬗ requires a signed wire input
wire signed [15:0] b;
wire [15:0] a = b >>> c[3:0];

54
Multiplication

55
3rd Grade: Decimal Multiplication
19 // multiplicand
* 12 // multiplier
38
+ 190
228 // product

⬗ Start with product 0, repeat steps until no multiplier digits

⬗ Multiply multiplicand by least significant multiplier digit
⬗ Add to product
⬗ Shift multiplicand one digit to the left (multiply by 10)
⬗ Shift multiplier one digit to the right (divide by 10)

⬗ Product of N-digit and M-digit numbers may have N+M digits

56
Binary Multiplication: Same Refrain
19 = 010011 // multiplicand
* 12 = 001100 // multiplier
0 = 000000000000
0 = 000000000000
76 = 000001001100
152 = 000010011000
0 = 000000000000
+ 0 = 000000000000
228 = 000011100100 // product

± Smaller base → more steps, each is simpler

⬗ Multiply multiplicand by least significant multiplier digit
+ 0 or 1 → no actual multiplication, add multiplicand or not
⬗ Add to total: we know how to do that
⬗ Shift multiplicand left, multiplier right by one digit
57
Software Multiplication
⬗ Can implement this algorithm in software
⬗ Inputs: md (multiplicand) and mr (multiplier)

int pd = 0; // product
int i = 0;
for (i = 0; i < 16 && mr != 0; i++) {
if (mr & 1) {
pd = pd + md;
}
md = md << 1; // shift left
mr = mr >> 1; // shift right
}
58
Hardware Multiply: Sequential

<< 1
Multiplicand Multiplier >> 1
(32 bit) (16 bit)

lsb==1?
32+
32
Product
(32 bit) we

⬗ Control: repeat 16 times

⬗ If least significant bit of multiplier is 1…
⬗ Then add multiplicand to product
⬗ Shift multiplicand left by 1
⬗ Shift multiplier right by 1
59
Hardware Multiply: Combinational
C<<4 C<<3 C<<4

16+

16+
0 0
C<<3 0

16+
0 P

16+
C<<2

16+
0 0 P
C<<1

16+
C<<1

16+
0 0

C C
0 0

⬗ Multiply by N bits at a time using N adders

⬗ Example: N=5, terms (P=product, C=multiplicand, M=multiplier)
⬗ P = (M[0] ? (C) : 0) + (M[1] ? (C<<1) : 0) +
(M[2] ? (C<<2) : 0) + (M[3] ? (C<<3) : 0) + …
⬗ Arrange like a tree to reduce gate delay critical path
⬗ Delay? N2 vs N*log N? Not that simple, depends on adder
⬗ Approx “2N” versus “N + log N”, with optimization: O(log N)
60
Partial Sums/Carries
⬗ Observe: carry-outs don’t have to be chained immediately
⬗ Can be saved for later and added back in
00111 = 7
+00011 = 3
00100 // partial sums (sums without carrries)
+00110 // partial carries (carries without sums)
01010 = 10

⬗ Partial sums/carries use simple half-adders, not full-adders

+ Aren’t “chained” → can be done in two levels of logic
– Must sum partial sums/carries eventually, and this sum is chained
⬗ d(CS-adder) = 2 + d(normal-adder)

61
Division

62
4th Grade: Decimal Division
9 // quotient
3 |29 // divisor | dividend
-27
2 // remainder

⬗ Shift divisor left (multiply by 10) until MSB lines up with dividend’s
⬗ Repeat until remaining dividend (remainder) < divisor
⬗ Find largest single digit q such that (q*divisor) < dividend
⬗ Set LSB of quotient to q
⬗ Subtract (q*divisor) from dividend
⬗ Shift quotient left by one digit (multiply by 10)
⬗ Shift divisor right by one digit (divide by 10)

63
Binary Division

1001 = 9
3 |29 = 0011 |011101
-24 = - 011000
5 = 000101
- 3 = - 000011
2 = 000010

64
Binary Division Hardware
⬗ Same as decimal division, except (again)
– More individual steps (base is smaller)
+ Each step is simpler
⬗ Find largest bit q such that (q*divisor) < dividend
⬗ q = 0 or 1
⬗ Subtract (q*divisor) from dividend
⬗ q = 0 or 1 → no actual multiplication, subtract divisor or not

⬗ Complication: largest q such that (q*divisor) < dividend

⬗ How do you know if (1*divisor) < dividend?
⬗ Human can “eyeball” this
⬗ Computer does not have eyeballs
⬗ Subtract and see if result is negative
65
Software Divide Algorithm
⬗ Can implement this algorithm in software
⬗ Inputs: dividend and divisor

remainder = 0;
quotient = 0;
for (int i = 0; i < 32; i++) {
remainder = (remainder << 1) | (dividend >> 31);
if (remainder >= divisor) {
quotient = (quotient << 1) | 1;
remainder = remainder - divisor;
} else {
quotient = (quotient << 1) | 0;
}
dividend = dividend << 1;
}
66
Divide Example
⬗ Input: Divisor = 00011 , Dividend = 11101

Step Remainder Quotient Remainder Dividend

0 00000 00000 00000 11101
1 00001 00000 00001 11010
2 00011 00001 00000 10100
3 00001 00010 00001 01000
4 00010 00100 00010 10000
5 00101 01001 00010 00000

⬗ Result: Quotient: 1001, Remainder: 10

67
Divider Circuit
Divisor
Quotient

Sub >=0
Shift in 0 or 1

Remainder Dividend

Shift in 0 or 1
msb
Shift in 0

⬗ N cycles for n-bit divide

68
Floating Point

69
Floating Point (FP) Numbers
⬗ Floating point numbers: numbers in scientific notation
⬗ Two uses

⬗ Use I: real numbers (numbers with non-zero fractions)

⬗ 3.1415926…
⬗ 2.1878…
⬗ 6.62 * 10–34

⬗ Use II: really big numbers

⬗ 3.0 * 108
⬗ 6.02 * 1023
⬗ Aside: best not used for currency values
70
Scientific Notation
⬗ Scientific notation:
⬗ Number [S,F,E] = S * F * 2E
⬗ S: sign
⬗ F: significand (fraction)
⬗ E: exponent
⬗ “Floating point”: binary (decimal) point has different magnitude
+ “Sliding window” of precision using notion of significant digits
⬗ Small numbers very precise, many places after decimal point
⬗ Big numbers are much less so, not all integers representable
⬗ But for those instances you don’t really care anyway
– Caveat: all representations are just approximations
⬗ Sometimes wierdos like 0.9999999 or 1.0000001 come up
+ But good enough for most purposes
71
IEEE 754 Standard Precision/Range
⬗ Single precision: float in C
⬗ 32-bit: 1-bit sign + 8-bit exponent + 23-bit significand
⬗ Range: 2.0 * 10–38 < N < 2.0 * 1038
⬗ Precision: ~7 significant (decimal) digits
⬗ Used when exact precision is less important (e.g., 3D games)

⬗ Double precision: double in C

⬗ 64-bit: 1-bit sign + 11-bit exponent + 52-bit significand
⬗ Range: 2.0 * 10–308 < N < 2.0 * 10308
⬗ Precision: ~15 significant (decimal) digits
⬗ Used for scientific computations

⬗ Numbers >10308 don’t come up in many calculations

⬗ 1080 ~ number of atoms in universe
72
73
74
Floating Point is Inexact
⬗ Accuracy problems sometimes get bad
⬗ FP arithmetic not associative: (A+B)+C not same as A+(B+C)
⬗ Addition of big and small numbers (summing many small numbers)
⬗ Subtraction of two big numbers
⬗ Example, what’s (1*10030 + 1*100) – 1*1030?
⬗ Intuitively:301*10 = 01
⬗ But: (1*10 + 1*10 ) – 1*1030 = (1*1030 – 1*1030) = 0
⬗ Reciprocal math: “x/y” versus ”x*(1/y)”
⬗ Reciprocal & multiply is faster than divide, but less precise
⬗ Compilers are generally conservative by default
⬗ GCC flag: –ffast-math (allows assoc. opts, reciprocal math)
⬗ Numerical analysis: field formed around this problem
⬗ Re-formulating algorithms in a way that bounds numerical error
⬗ In your code: never test for equality between FP numbers
⬗ Use something like: if (abs(a-b) < 0.00001) then …
75
Pentium FDIV Bug
⬗ Pentium shipped in August 1994
⬗ Intel actually knew about the bug in July
⬗ But calculated that delaying the project a month would cost ~$1M
⬗ And that in reality only a dozen or so people would encounter it
⬗ They were right… but one of them took the story to EE times
⬗ By November 1994, firestorm was full on
⬗ IBM said that typical Excel user would encounter bug every month
⬗ Assumed 5K divisions per second around the clock
⬗ People believed the story
⬗ IBM stopped shipping Pentium PCs
⬗ By December 1994, Intel promises full recall
⬗ Total cost: ~$550M
76
Arithmetic Latencies
⬗ Latency in cycles of common arithmetic operations
⬗ Source: Agner Fog, https://www.agner.org/optimize/#manuals
⬗ AMD Ryzen core

Int 32 Int 64 Fp 32 Fp 64
Add/Subtract 1 1 5 5
Multiply 3 3 5 5
Divide 14-30 14-46 8-15 8-15

⬗ Divide is variable latency based on the size of the dividend

⬗ Detect number of leading zeros, then divide
⬗ Why is FP divide faster than integer divide?
77
Summary
App App App ⬗ Integer addition
System software
⬗ Most timing-critical operation in datapath
⬗ Hardware != software
⬗ Exploit sub-addition parallelism
Mem CPU I/O

⬗ Fast addition
⬗ Carry-select: parallelism in sum
⬗ Multiplication
⬗ Chains and trees of additions
⬗ Division
⬗ Floating point

Addition and Multiplication 1
100% (1)
Addition and Multiplication 1
27 pages
09-Addition Multiplication
No ratings yet
09-Addition Multiplication
27 pages
Week 6: Arithmetic Functions and Circuits: Adding Two Bits
No ratings yet
Week 6: Arithmetic Functions and Circuits: Adding Two Bits
12 pages
Digital Arithmetic for Engineers
No ratings yet
Digital Arithmetic for Engineers
15 pages
Computer Architecture
No ratings yet
Computer Architecture
27 pages
Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..
No ratings yet
Arithmetic Circuits: Didn't I Learn How To Do Addition in The Second Grade? MIT Courses Aren't What They Used To Be..
35 pages
Logic Design Week 9 Binary Adders
No ratings yet
Logic Design Week 9 Binary Adders
34 pages
L04 Arithmetic I
No ratings yet
L04 Arithmetic I
70 pages
IT3030E CA Chap4 Arithmetics
No ratings yet
IT3030E CA Chap4 Arithmetics
64 pages
DDCA Ch5
No ratings yet
DDCA Ch5
101 pages
ALU Design and Binary Arithmetic
No ratings yet
ALU Design and Binary Arithmetic
43 pages
IT3030E CA Chap4 Arithmetics
No ratings yet
IT3030E CA Chap4 Arithmetics
67 pages
Week7 Slides
No ratings yet
Week7 Slides
47 pages
ELEC2141 Arithmetic Circuits
No ratings yet
ELEC2141 Arithmetic Circuits
43 pages
EEE 241 Chap 04
No ratings yet
EEE 241 Chap 04
74 pages
Arithmetic Circuits Lecture Notes
No ratings yet
Arithmetic Circuits Lecture Notes
52 pages
Vlsi Implementation For High Speed Adders
100% (1)
Vlsi Implementation For High Speed Adders
6 pages
CH 3
No ratings yet
CH 3
82 pages
September 29, 2003 Addition and Multiplication 1
No ratings yet
September 29, 2003 Addition and Multiplication 1
24 pages
DLD Module 7 Print
No ratings yet
DLD Module 7 Print
13 pages
Lec09 (DSD)
No ratings yet
Lec09 (DSD)
20 pages
Lecture 35
No ratings yet
Lecture 35
34 pages
Arithmetic Unit 5
No ratings yet
Arithmetic Unit 5
98 pages
Addition / Subtraction: Parts Chapters I
No ratings yet
Addition / Subtraction: Parts Chapters I
103 pages
Unit 2
No ratings yet
Unit 2
73 pages
Unit - 2 Arithmetic Unit
No ratings yet
Unit - 2 Arithmetic Unit
71 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
33 pages
DigitalLogic ComputerOrganization L13 Arithmetic Handout
No ratings yet
DigitalLogic ComputerOrganization L13 Arithmetic Handout
37 pages
Datapath Design - 1
No ratings yet
Datapath Design - 1
24 pages
ARITHMETIC and LOGIC UNIT - in This Lecture, We Will Examine How
No ratings yet
ARITHMETIC and LOGIC UNIT - in This Lecture, We Will Examine How
12 pages
Adder Circuits
No ratings yet
Adder Circuits
26 pages
COA-II Unit
No ratings yet
COA-II Unit
65 pages
Module 4 PDF
No ratings yet
Module 4 PDF
25 pages
Adders and Multipliers
No ratings yet
Adders and Multipliers
59 pages
Arithmetic Logic Unit (ALU) : Introduction To Computer P Yung-Yu Chuang
No ratings yet
Arithmetic Logic Unit (ALU) : Introduction To Computer P Yung-Yu Chuang
39 pages
Fast Arithmetic Operations in Computers
No ratings yet
Fast Arithmetic Operations in Computers
43 pages
CO Module4
No ratings yet
CO Module4
72 pages
DLD Module 7
No ratings yet
DLD Module 7
19 pages
Chap - 04 ALU
No ratings yet
Chap - 04 ALU
32 pages
Coa m3 Part2 Extra Slides
No ratings yet
Coa m3 Part2 Extra Slides
66 pages
Arithmetic Circuits & Overflow
No ratings yet
Arithmetic Circuits & Overflow
52 pages
Computer Architecture Basics
No ratings yet
Computer Architecture Basics
56 pages
Computer Organization and Architecture Arithmetic & Logic Unit
No ratings yet
Computer Organization and Architecture Arithmetic & Logic Unit
12 pages
Com Bi National Logic Circuit
No ratings yet
Com Bi National Logic Circuit
9 pages
Coa M3
No ratings yet
Coa M3
66 pages
Notes 9
No ratings yet
Notes 9
10 pages
SHMT Chap2 Addition Subtraction
No ratings yet
SHMT Chap2 Addition Subtraction
106 pages
SHMT - Chap2 - Addition - Subtraction-Đã G P
No ratings yet
SHMT - Chap2 - Addition - Subtraction-Đã G P
300 pages
Ca Unit 2 Prabu
No ratings yet
Ca Unit 2 Prabu
30 pages
DDMP Unit 3
No ratings yet
DDMP Unit 3
43 pages
Lec 14
No ratings yet
Lec 14
30 pages
Lec06 ALU
No ratings yet
Lec06 ALU
59 pages
CS 5515 Computer Architecture: Lecture 6: Processor Basics 2
No ratings yet
CS 5515 Computer Architecture: Lecture 6: Processor Basics 2
20 pages
Ch00 - Introduction To Subject - Review
No ratings yet
Ch00 - Introduction To Subject - Review
59 pages
CA Lecture 8
No ratings yet
CA Lecture 8
49 pages
04 Logic Gates
No ratings yet
04 Logic Gates
29 pages
Chapter 4 (Complete)
No ratings yet
Chapter 4 (Complete)
91 pages
Mutex Vs Semaphore
No ratings yet
Mutex Vs Semaphore
3 pages
Microprocessors and Interfacing Devices PDF
No ratings yet
Microprocessors and Interfacing Devices PDF
160 pages
Army's Nett Warrior Evolution
No ratings yet
Army's Nett Warrior Evolution
4 pages
Comp7 Sept Exam 1st Set
No ratings yet
Comp7 Sept Exam 1st Set
3 pages
A Look at Running Older Versions of Microsoft Office On Windows 10 - Microsoft Community
No ratings yet
A Look at Running Older Versions of Microsoft Office On Windows 10 - Microsoft Community
5 pages
Avishkar 01
No ratings yet
Avishkar 01
39 pages
6.arrays - 37
100% (1)
6.arrays - 37
13 pages
X-631 Wheel Aligner - INSTALL - DIR
No ratings yet
X-631 Wheel Aligner - INSTALL - DIR
2 pages
EEI For Forge Users Guide PDF
No ratings yet
EEI For Forge Users Guide PDF
34 pages
PML Publisher User Guide 1.1
100% (1)
PML Publisher User Guide 1.1
19 pages
Kaspersky EDR
No ratings yet
Kaspersky EDR
49 pages
SIM Card Reg Documentation
100% (4)
SIM Card Reg Documentation
54 pages
Mobile No:+91-8861531584 Profile Summary:: A.Ramanjul Reddy
No ratings yet
Mobile No:+91-8861531584 Profile Summary:: A.Ramanjul Reddy
2 pages
Samsung CCTV Manual
No ratings yet
Samsung CCTV Manual
2 pages
HF Bluetooth Tws
No ratings yet
HF Bluetooth Tws
18 pages
Railways Computer Pyq - Compressed
No ratings yet
Railways Computer Pyq - Compressed
12 pages
PHP Basics for Web Developers
No ratings yet
PHP Basics for Web Developers
72 pages
pcsc3 v2.01.4 PDF
No ratings yet
pcsc3 v2.01.4 PDF
50 pages
Software Testing Techniques
No ratings yet
Software Testing Techniques
69 pages
Encrypted Document Analysis
100% (1)
Encrypted Document Analysis
16 pages
HCIA-WLAN - V3.0 Lab Guide
No ratings yet
HCIA-WLAN - V3.0 Lab Guide
61 pages
VLSI and Chip Design - Table of Contents
No ratings yet
VLSI and Chip Design - Table of Contents
4 pages
GL4000 Glasses Camera Guide
No ratings yet
GL4000 Glasses Camera Guide
24 pages
Acer Aspire 3680 (Quanta ZR1) Schematics
No ratings yet
Acer Aspire 3680 (Quanta ZR1) Schematics
30 pages
ETM-Generating SD Orders
No ratings yet
ETM-Generating SD Orders
2 pages
EtpPrD0ATC-1000 Quick Start Guide
No ratings yet
EtpPrD0ATC-1000 Quick Start Guide
1 page
HD Cam Manual PDF
No ratings yet
HD Cam Manual PDF
15 pages
VHDL Processes & Flip-Flops Guide
No ratings yet
VHDL Processes & Flip-Flops Guide
24 pages
User Guide: E9000 Server V100R001
No ratings yet
User Guide: E9000 Server V100R001
189 pages
Carestream drx-1
100% (1)
Carestream drx-1
7 pages

Fast Computer Arithmetic Guide

Uploaded by

Fast Computer Arithmetic Guide

Uploaded by

COMPUTER

⬗ Fast integer addition

Tinsn-mem Tregfile TALU Tdata-mem Tregfile

⬗ Some old machines use decimal (base10) with only 0/1

⬗ In hardware, integers have fixed width

⬗ Numbers >2N represented using multiple fixed-width integers

⬗ Option II: two’s complement (2C)

⬗ Now, add the conventional way…

⬗ Trick to negating a number quickly: –B = B’ + 1

⬗ Sum of two N-digit numbers can yield an N+1 digit number

– More steps (smaller base)

⬗ N 1-bit full adders “chained” together A0

⬗ How fast is an N-bit ripple-carry adder? CO

⬗ Can implement any combinational function in “2” logic levels

⬗ Caveat: “2” assumes gates have few (<8 ?) inputs

⬗ In hardware, it’s easier to trade resources for latency

⬗ One example of this: speculation

⬗ Does this make sense in software? Not on a uni-processor

⬗ Still mostly linear (~2x) 1

with 0 and 1 carry-in

⬗ Another approach to using additional resources

⬗ How much does this cost?

⬗ Infinite hardware CLA

⬗ Tree gives logarithmic delay

⬗ Output: C1, G1-0, P1-0 A0 G0

⬗ Bottom first-level CLA block B0 P0

⬗ Input: C2, G2, P2, G3, P3

⬗ Second-level CLA block C2 P3-0

⬗ Output: C2, G3-0, P3-0

⬗ These 3 blocks are “the same”

⬗ Input: C0, G3-0, P3-0 C4

⬗ d0: C0, Ai, Bi A0

⬗ d0: C0, Ai, Bi A0

⬗ d3: P1-0, G1-0 , C1 , P3-2 , G3-2

⬗ New signals ready at d5

⬗ d0: C0, Ai, Bi A0 G0

⬗ d3: P1-0, G1-0 , C1 , P3-2 , G3-2

⬗ New signals ready at d7

⬗ 2 for “outer” CLA, 4 for “interior” C3 P3-2

⬗ G/P go “up”, C go “down”

⬗ Total: 9 (1 for GP, 2 for S) B3 P3

⬗ 2L: more gates and slower?

⬗ Individual carry equations

⬗ Second level: 14 gates * 1 block G7-4

⬗ CLA for a 64-bit adder?

⬗ What is ready after 1 gate delay? G3-0

⬗ What is ready after 3 gate delays?

⬗ Interior carries of 1st group C4

⬗ C1, C2, C3 G7-4

⬗ C4, C8, C12 G7-4

⬗ C9, C10, C11 G7-4

⬗ Essentially, all remaining carries C8 G15-0

⬗ All sum bits ready after 9 delays! C12

⬗ Same as 2-level 4-bit CLA G15-12

⬗ Even more optimizations for incrementers

⬗ Rotations are less useful…

⬗ 16 consecutive “left-shift-by-1-or-0” blocks?

⬗ Start with product 0, repeat steps until no multiplier digits

⬗ Product of N-digit and M-digit numbers may have N+M digits

± Smaller base → more steps, each is simpler

⬗ Control: repeat 16 times

⬗ Multiply by N bits at a time using N adders

⬗ Partial sums/carries use simple half-adders, not full-adders

⬗ Complication: largest q such that (q*divisor) < dividend

Step Remainder Quotient Remainder Dividend

⬗ Result: Quotient: 1001, Remainder: 10

⬗ N cycles for n-bit divide

⬗ Use I: real numbers (numbers with non-zero fractions)

⬗ Use II: really big numbers

⬗ Double precision: double in C

⬗ Numbers >10308 don’t come up in many calculations

⬗ Divide is variable latency based on the size of the dividend

You might also like