Digital Engineering
Fall 2024
Lecture 02 - Floating Point Arithmetic
Instructor: Dr. Tarek Abdul Hamid
The World is Not Just Integers
Programming languages support numbers with fraction
Called floating-point numbers
Examples:
3.14159265… (π)
2.71828… (e)
0.000000001 or 1.0 × 10–9 (seconds in a nanosecond)
86,400,000,000,000 or 8.64 × 1013 (nanoseconds in a day)
last number is a large integer that cannot fit in a 32-bit integer
We use a scientific notation to represent
Very small numbers (e.g. 1.0 × 10–9)
Very large numbers (e.g. 8.64 × 1013)
Scientific notation: ± d . f1f2f3f4 … × 10 ± e1e2e3
2 Dr. Tarek Abdul Hamid Digital Engineering
Floating-Point Numbers
Examples of floating-point numbers in base 10 …
5.341×103 , 0.05341×105 , –2.013×10–1 , –201.3×10–3
decimal point
Examples of floating-point numbers in base 2 …
1.00101×223 , 0.0100101×225 , –1.101101×2–3 , –1101.101×2–6
Exponents are kept in decimal for clarity binary
point
The binary number (1101.101)2 = 23+22+20+2–1+2–3 = 13.625
Floating-point numbers should be normalized
Exactly one non-zero digit should appear before the point
In a decimal number, this digit can be from 1 to 9
In a binary number, this digit should be 1
Normalized FP Numbers: 5.341×103 and –1.101101×2–3
NOT Normalized: 0.05341×105 and –1101.101×2–6
3 Dr. Tarek Abdul Hamid Digital Engineering
Floating-Point Representation
A floating-point number is represented by the triple
S is the Sign bit (0 is positive and 1 is negative)
Representation is called sign and magnitude
E is the Exponent field (signed)
Very large numbers have large positive exponents
Very small close-to-zero numbers have negative exponents
More bits in exponent field increases range of values
F is the Fraction field (fraction after binary point)
More bits in fraction field improves the precision of FP numbers
S Exponent Fraction
Value of a floating-point number = (-1)S × val(F) × 2val(E)
4 Dr. Tarek Abdul Hamid Digital Engineering
IEEE 754 Floating-Point Standard
Found in virtually every computer invented since 1980
Simplified porting of floating-point numbers
Unified the development of floating-point algorithms
Increased the accuracy of floating-point numbers
Single Precision Floating Point Numbers (32 bits)
1-bit sign + 8-bit exponent + 23-bit fraction
S Exponent8 Fraction23
Double Precision Floating Point Numbers (64 bits)
1-bit sign + 11-bit exponent + 52-bit fraction
S Exponent11 Fraction52
(continued)
5 Dr. Tarek Abdul Hamid Digital Engineering
Normalized Floating Point
Numbers
For a normalized floating point number (S, E, F)
S E F = f1 f2 f3 f4 …
Significand is equal to (1.F)2 = (1.f1f2f3f4…)2
IEEE 754 assumes hidden 1. (not stored) for normalized numbers
Significand is 1 bit longer than fraction
Value of a Normalized Floating Point Number is
(–1)S × (1.F)2 × 2val(E)
(–1)S × (1.f1f2f3f4 …)2 × 2val(E)
(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2val(E)
(–1)S is 1 when S is 0 (positive), and –1 when S is 1 (negative)
6 Dr. Tarek Abdul Hamid Digital Engineering
Biased Exponent Representation
How to represent a signed exponent? Choices are …
Sign + magnitude representation for the exponent
Two’s complement representation
Biased representation
IEEE 754 uses biased representation for the exponent
Value of exponent = val(E) = E – Bias (Bias is a constant)
Recall that exponent field is 8 bits for single precision
E can be in the range 0 to 255
E = 0 and E = 255 are reserved for special use (discussed later)
E = 1 to 254 are used for normalized floating point numbers
Bias = 127 (half of 254), val(E) = E – 127
val(E=1) = –126, val(E=127) = 0, val(E=254) = 127
7 Dr. Tarek Abdul Hamid Digital Engineering
Biased Exponent – Cont’d
For double precision, exponent field is 11 bits
E can be in the range 0 to 2047
E = 0 and E = 2047 are reserved for special use
E = 1 to 2046 are used for normalized floating point numbers
Bias = 1023 (half of 2046), val(E) = E – 1023
val(E=1) = –1022, val(E=1023) = 0, val(E=2046) = 1023
Value of a Normalized Floating Point Number is
(–1)S × (1.F)2 × 2E – Bias
(–1)S × (1.f1f2f3f4 …)2 × 2E – Bias
(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2E – Bias
8 Dr. Tarek Abdul Hamid Digital Engineering
Examples of Single Precision Float
What is the decimal value of this Single Precision float?
10111110001000000000000000000000
Solution:
Sign = 1 is negative
Exponent = (01111100)2 = 124, E – bias = 124 – 127 = –3
Significand = (1.0100 … 0)2 = 1 + 2-2 = 1.25 (1. is implicit)
Value in decimal = –1.25 × 2–3 = –0.15625
What is the decimal value of?
01000001001001100000000000000000
Solution: implicit
Value in decimal = +(1.01001100 … 0)2 × 2130–127 =
(1.01001100 … 0)2 × 23 = (1010.01100 … 0)2 = 10.375
9 Dr. Tarek Abdul Hamid Digital Engineering
Examples of Double Precision Float
What is the decimal value of this Double Precision float ?
01000000010100101010000000000000
00000000000000000000000000000000
Solution:
Value of exponent = (10000000101)2 – Bias = 1029 – 1023 = 6
Value of double float = (1.00101010 … 0)2 × 26 (1. is implicit) =
(1001010.10 … 0)2 = 74.5
What is the decimal value of ?
10111111100010000000000000000000
00000000000000000000000000000000
Do it yourself! (answer should be –1.5 × 2–7 = –0.01171875)
10 Dr. Tarek Abdul Hamid Digital Engineering
Converting FP Decimal to Binary
Convert –0.8125 to binary in single and double precision
Solution:
Fraction bits can be obtained using multiplication by 2
0.8125 × 2 = 1.625
0.625 × 2 = 1.25
0.25 × 2 = 0.5 0.8125 = (0.1101)2 = ½ + ¼ + 1/16 = 13/16
0.5 × 2 = 1.0
Stop when fractional part is 0
Fraction = (0.1101)2 = (1.101)2 × 2 –1 (Normalized)
Exponent = –1 + Bias = 126 (single precision) and 1022 (double)
Single
10111111010100000000000000000000
Precision
10111111111010100000000000000000 Double
Precision
00000000000000000000000000000000
11 Dr. Tarek Abdul Hamid Digital Engineering
Largest Normalized Float
What is the Largest normalized float?
Solution for Single Precision:
01111111011111111111111111111111
Exponent – bias = 254 – 127 = 127 (largest exponent for SP)
Significand = (1.111 … 1)2 = almost 2
Value in decimal ≈ 2 × 2127 ≈ 2128 ≈ 3.4028 … × 1038
Solution for Double Precision:
01111111111011111111111111111111
11111111111111111111111111111111
Value in decimal ≈ 2 × 21023 ≈ 21024 ≈ 1.79769 … × 10308
Overflow: exponent is too large to fit in the exponent field
12 Dr. Tarek Abdul Hamid Digital Engineering
Smallest Normalized Float
What is the smallest (in absolute value) normalized float?
Solution for Single Precision:
00000000100000000000000000000000
Exponent – bias = 1 – 127 = –126 (smallest exponent for SP)
Significand = (1.000 … 0)2 = 1
Value in decimal = 1 × 2–126 = 1.17549 … × 10–38
Solution for Double Precision:
00000000000100000000000000000000
00000000000000000000000000000000
Value in decimal = 1 × 2–1022 = 2.22507 … × 10–308
Underflow: exponent is too small to fit in exponent field
13 Dr. Tarek Abdul Hamid Digital Engineering