Elements of Floating-Point Arithmetic-07
Elements of Floating-Point Arithmetic-07
August 2007
ii
Chapter 1
Floating-point Numbers
1.1 Representations
On paper In storage
±d1 .d2 ...dt × β e stored in three fields,
0 < d1 < β, 0 ≤ di < β (i > 1) usually consecutive:
β: base (or radix) sign: 1 bit
e: exponent exponent: store e
t: precision fraction: store d1 , ..., dt
The base β is almost universally 2. In this case, d1 is always 1. Thus,
it can be implicit, saving one bit, called hidden bit. Other commonly used
bases are 10 and 16. Base 16 provides wider range, but wastes bits in
fraction. For example, in base 16, when 1.016 × 160 is stored in binary bits,
there are three leading zeros. In general, there can be up to log β − 1 leading
zeros.
A floating-point number system is characterized by four parameters:
• base β (also called radix)
• precision t
• exponent range emin ≤ e ≤ emax
Machine precision, ǫM , is defined as the distance between 1.0 and the next
larger floating-point number, that is, ǫM = β 1−t .
Floating-point numbers are used to approximate real numbers, fl(x) de-
notes the floating-point number that approximates a real number x.
One way of measuring the error in an approximation is related to the
absolute error. If the nearest rounding is applied and fl(x) = d1 .d2 · · · dt ×β e ,
1
2 CHAPTER 1. FLOATING-POINT NUMBERS
then the absolute error |x − fl(x)| is no larger than 1/2β −t+1 β e , which is half
of the unit in the last place in fl(x). Thus we define one ulp (unit in the last
place) of a floating-point number d1 .d2 · · · dt × β e as β −t+1 β e .
Another way of measuring the error in an approximation is related to
the relative error. Again, if the nearest rounding is applied, then the relative
error
|x − fl(x)| 1/2β −t+1 β e 1
≤ ≤ β −t+1 ,
|fl(x)| |fl(x)| 2
From this small system, we can see that the floating-point numbers in
each interval [β e , β e+1 ), emin ≤ e ≤ emax , are equally spaced, with the
distance β −t+1 β e = 1 ulp.
exp fraction
01 89 31
Precision
Double precision supports single, so that intermediate results in evaluating
an expression are computed in high precision. Why? Extra-precise arith-
metic attenuates the risk of error due to roundoff. For example, in evaluating
an inner product of two single precision vectors, the intermediate results are
computed in double precision to achieve high accuracy.
Exponent
Since the exponent can be positive or negative, some method must be chosen
to represent its sign. Two common methods of representing signed integers
are sign/magnitude and two’s complement. The IEEE binary standard does
not use either of these methods to represent the exponent, but instead uses
a biased representation for efficient calculation of exponents. In the case
of single precision, where the exponent is stored in 8 bits, the bias is 127
(for double precision it is 1023). What this means is that if k is the value
of the exponent bits interpreted as an unsigned integer (biased), then the
exponent of the floating-point number is k − 127 (unbiased).
Single Double
Exponent width in bits 8 11
Format width in bits 32 64
bias 127 1023
For example, in single precision, if the binary integer stored in the ex-
ponent field is 00000001, then the exponent of the floating-point number is
4 CHAPTER 1. FLOATING-POINT NUMBERS
−126 = emin . Similarly, the exponent emax = 127 is stored in the exponent
field by 11111110. What do 00000000 and 11111111 represent? It will be
discussed later.
Example Analogously, in our small system, we use bias 7. For example,
the binary format 1 0001 100 represents the binary floating-point number
−1.100 × 2−6 .
Infinities
There are two infinities: ±∞. There are two cases where an infinity is
produced. When a normal nonzero floating-point number is divided by zero,
an infinity is produced. Another case is when overflow occurs. Infinities
provide a way to continue the computation in these two cases. The infinity
symbol obeys the usual mathematical conventions regarding infinity, such
as ∞ + ∞ = ∞, (−1) ∗ ∞ = −∞, and (finite)/∞ = 0.
where m = max(|x|, |y|) = x. Then we will get the computed sin θ = 1.000 ×
2−5 or cos θ = 1.000 × 20 .
The infinity symbol is represented by a zero fraction and the same ex-
ponent field as a NaN (emax + 1); the sign bit distinguishes between ±∞.
1.3. IEEE FLOATING-POINT STANDARD 5
NaNs √
Traditionally, the computation of 0/0 or −1 has been treated as an unre-
coverable error which causes a computation to halt. However it makes sense
for a computation to continue in such a situation. IEEE standard has NaN
(Not a Number). Just as the infinities, NaNs
√ provide a way to continue the
computation in situations like 0/0 and −1. Suppose that the subroutine
zero(f,a,b) finds the zeros of a function f in the interval [a, b]. Normally,
we would not require the user to input the domain of f . If in finding a
zero, a guess falls outside the domain of f and the function is evaluated at
the guess, then it makes sense to continue the computation. For example,
f = (x2 − 1)/(x − 1), if the function is evaluated at the guess x = 1, the
result is NaN, thus x = 1 is rejected and the subroutine continues to find a
zero.
Signed zeros
Zero is represented by the exponent emin − 1 and a zero fraction. Since the
sign bit can take on two different values, there are two zeros: ±0. Naturally,
when comparing +0 and −0, +0 = −0.
Signed zeros can be useful. When x > 0 underflows to +0, 1/x gives +∞;
when x < 0 underflows to −0, 1/x gives −∞. They work like limx→0+ 1/x =
+∞ and limx→0− 1/x = −∞. Also, with signed zeros, we have x = 1/(1/x)
for x = ±∞. Without signed zeros, it would cause confusion. Another
example involves log x. With signed zeros, we preserve the mathematical
indentities limx→0+ log x = −∞ and limx→0− log x = NaN.
The above infinities, NaNs, and signed zeros can be similarly introduced
into our small floating-point number system. For example, 1 1111 000 rep-
resents −∞ and 1 1111 010 represents NaN.
6 CHAPTER 1. FLOATING-POINT NUMBERS
X X X X X X X
Denormalized numbers
In our small system, 1.000 × 2emin is stored in the bit format:
0 0001 000
The distance between this number and the next larger number 1.001 × 2emin
is 2−t+1 ×2emin . However, its distance to zero is 2emin . It makes sense to fill in
this gap with the numbers 0.001 × 2emin , ..., 0.111 × 2emin . These numbers are
called denormalized numbers. Figure 1.3.1 plots the positive denormalized
numbers in our small floating-point number system.
As shown above, without denormals, the spacing between two floating-
point numbers around β emin changes from β −t+1 β emin to β emin . With denor-
mals, adjacent spacing are either the same length or differ by a factor of β.
Thus, small numbers gradually underflow to zero.
Example 1.3.2 In our small floating-point system, if normalized numbers
x = 1.001 × 2−6 and y = 1.000 × 2−6 , then x − y is too small to be repre-
sented in normalized number range and must be flushed to zero (underflow),
although x 6= y. When the denormalized numbers are introduced, x − y does
not underflow, instead x−y = 0.001×2−6 . The use of denormalized numbers
guarantees
x − y = 0 ⇐⇒ x = y.
In the IEEE standard, the denormals are encoded by emin − 1 in the
exponent and a nonzero in the fraction. In this case, there is no hidden bit.
For example, using our small system, the bit format
0 0000 001
represents the denormalized number 0.001 × 2emin .
Denormals are less accurate due to the leading zeros. Assume our small
system. The computed the result of 1.011 × 2−4 ⊗ 1.010 × 2−4 = 0.011 × 2−6 ,
a denormal. The exact product is 1.10111 × 2−8 and the relative error
|1.10111 × 2−8 − 0.011 × 2−6 |
> 2−3 ,
1.10111 × 2−8
which is larger than the roundoff unit u = 2−4 .
Table 1.1 summarizes IEEE values.
1.3. IEEE FLOATING-POINT STANDARD 7
Floating-point Arithmetic
The following shows two algorithms for computing the 2-norm of a vector
x = [xi ], i = 1, ..., n.
9
10 CHAPTER 2. FLOATING-POINT ARITHMETIC
• conversions of formats
The above procedure is iterated until the correction is sufficiently small. It-
erative refinement improves the solution only when the residual is calculated
in high precision.
How can you tell whether the system, including the processor and the
compiler, on your machine supports extra precision? Using our small number
system as an example to illustrate the idea, you can write the expression:
12 CHAPTER 2. FLOATING-POINT ARITHMETIC
Round → +∞ → −∞ →0 → nearest
1.1001 × 22 1.101 × 22 1.100 × 22 1.100 × 22 1.100 × 22
Error Analysis
x2 x3 xn
1+x+ + + ··· +
2! 3! n!
to approximate
x2 x3 xn
ex = 1 + x + + + ··· + + ···,
2! 3! n!
then the truncation error is
xn+1 xn+2
+ + ···.
(n + 1)! (n + 2)!
h2 ′′
f (x + h) = f (x) + hf ′ (x) + f (ξ), for some ξ ∈ [x, x + h],
2!
we can use the following approximation:
f (x + h) − f (x)
yh (x) = ≈ f ′ (x).
h
The discretization error is Edis = |f ′′ (ξ)|h/2.
15
16 CHAPTER 3. ERROR ANALYSIS
h yh (1) error
10−1 2.85884195487388 1.40560126415e − 1
10−2 2.73191865578708 1.36368273280e − 2
10−3 2.71964142253278 1.35959407373e − 3
10−4 2.71841774707848 1.35918619434e − 4
10−5 2.71829541991231 1.35914532646e − 5
10−6 2.71828318698653 1.35852748429e − 6
10−7 2.71828196396484 1.35505794585e − 7
10−8 2.71828177744737 −5.10116753283e − 8
10−9 2.71828159981169 −2.28647355716e − 7
10−10 2.71827893527643 −2.89318261570e − 6
10−11 2.71827005349223 −1.17749668154e − 5
10−12 2.71827005349223 −1.17749668154e − 5
10−13 2.71338507218388 −4.89675627517e − 3
10−14 2.66453525910038 −5.37465693587e − 2
10−15 2.66453525910038 −5.37465693587e − 2
Note that both truncation error and discretization error have nothing to
do with computation. If the arithmetic is perfect (no rounding errors), the
discretization error decreases as h decreases. In practice, however, rounding
errors are unavoidable. Consider the above example and let f (x) = ex . We
computed yh (1) on a SUN Sparc V in MATLAB 5.20.
Table 3.1 shows that as h decreases the error first decreases and then
increases. This is because of the combination of discretization error and
rounding errors. In this example, the discretization error is
h ′′ h h
Edis = |f (ξ)| ≤ e1+h ≈ e for small h.
2 2 2
Now, we consider the rounding errors. Let the computed yh (x) be
(e(x+h)(1+δ0 ) (1 + δ1 ) − ex (1 + δ2 ))(1 + δ3 )
ŷh (x) = (1 + δ4 )
h
ex+h (1 + δ0 + δ1 + δ3 + δ4 ) − ex (1 + δ2 + δ3 + δ4 )
≈ ,
h
for |δi | ≤ u (i = 0, 1, 2, 3, 4). In the above derivation, we assume that δi are
small so that we ignore the terms like δi δj or higher order. We also assume
3.2. FORWARD AND BACKWARD ERRORS 17
−5
x 10
2
1.8
1.6
1.4
1.2
TOTAL ERROR
1
0.8
0.6
0.4
0.2
0
−10 −9 −8 −7 −6 −5
10 10 10 10 10 10
H
ξ1 ex+h − ξ2 ex
Eround = |yh (x) − ŷh (x)| ≈ for |ξ1 | ≤ 4u and |ξ2 | ≤ 3u.
h
When x = 1, we have
7u
e. Eround ≈
h
So the rounding error increases as h decreases. Combining both errors, we
get the total error:
h 7u
Etotal = Edis + Eround ≈ + e.
2 h
then the absolute error |y − ŷ| and the relative error |y − ŷ|/|y| are called
forward errors. Alternatively, we can ask: “For what set of data have we
solved our problem?”. That is, the computed result ŷ is the exact result for
the input x + ∆x, i.e., ŷ = f (x + ∆x). In general, there may be many such
∆x, so we are interested in minimal such ∆x and a bound for |∆x|. This
bound, possibly divided by |x|, is called backward error.
√
For example, consider the problem of computing the square root y = x.
The IEEE standard requires that the computed square root
√ √
ŷ = fl( x) = x (1 + δ), |δ| ≤ u.
Then the relative error or the forward error is |δ|, which is bounded by u.
What is the backward error? Set
√ √
ŷ = x (1 + δ) = x + ∆x,
s n = x1 ⊕ x 2 ⊕ · · · ⊕ x n
s2 = x1 ⊕ x2 = x1 (1 + ǫ1 ) + x2 (1 + ǫ1 )
s3 = s2 ⊕ x3 = (s2 + x3 )(1 + ǫ2 )
= x1 (1 + ǫ1 )(1 + ǫ2 ) + x2 (1 + ǫ1 )(1 + ǫ2 )
+ x3 (1 + ǫ2 )
sn = x1 (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
+ x2 (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
+ x3 (1 + ǫ2 ) · · · (1 + ǫn−1 )
+ ···
+ xn−1 (1 + ǫn−2 )(1 + ǫn−1 )
+ xn (1 + ǫn−1 )
Define
1 + η1 = (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
3.2. FORWARD AND BACKWARD ERRORS 19
1 + η2 = (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
1 + η3 = (1 + ǫ2 ) · · · (1 + ǫn−1 )
···
1 + ηn−1 = (1 + ǫn−2 )(1 + ǫn−1 )
1 + ηn = (1 + ǫn−1 )
|ηn | = |ǫn−1 | ≤ u
1 + ηn−1 = 1 + (ǫn−2 + ǫn−1 ) + ǫn−2 ǫn−1
|ηn−1 | ≈ |ǫn−2 + ǫn−1 | ≤ 2u
In general,
|η1 | ≤ (n − 1)u
|ηi | ≤ (n − i + 1)u, i = 2, 3, ..., n
(1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn ) = 1 + η
where
|η| ≤ 1.06nu.
Thus
x1 ⊕ · · · ⊕ xn = x1 (1 + η1 ) + · · · + xn (1 + ηn )
|η1 | ≤ 1.06(n − 1)u
|ηi | ≤ 1.06(n − i + 1)u, i = 2, 3, ..., n
This gives much smaller backward error. A formal proof, much longer and
trickier than the algorithm, can be found in [2, Page 615] or [1]. Here is an
intuitive explanation [1]. In the step t = s + y of updating the partial sum
s, we have
s
+ yh yl
t
where y is partitioned into its high-order bits yh and low-order bits yl ac-
cording to s. Next, t − s gives an approximation of yh . Thus, c = (t − s) − y
is an proximation of −yl , which is used as a correction incorporated into the
next summand xi .
The process of bounding the backward error is called backward error
analysis. The motivation is to interpret rounding errors as perturbations in
the data. Consequently, it reduces the question of estimating the forward
error to perturbation theory. We will see its significance in the following
sections.
To illustrate the forward and backward errors, let us consider the compu-
tation of x̂ − ŷ , where x̂ and ŷ can be previously computed results. Assume
that x and y are exact results and x̂ = x(1 + δx ) and ŷ = y(1 + δy ), then
It then follows that x̂ ⊖ ŷ = x(1 + δx )(1 + δ) − y(1 + δy )(1 + δ). Ignoring the
second order terms δx δ and δy δ and letting δ1 = δx + δ and δ2 = δy + δ, we
get
x̂ ⊖ ŷ = x(1 + δ1 ) − y(1 + δ2 ).
If |δx | and |δy | are small, then |δ1 |, |δ2 | ≤ u + max(|δx |, |δy |) are also small
i.e., the backward errors are small. However, the forward error (relative
error)
|(x̂ ⊖ ŷ) − (x − y)| |xδ1 − yδ2 |
Erel = = .
|x − y| |x − y|
If δ1 6= δ2 i.e., δx 6= δy , it is possible that Erel is large when |x − y| is small,
i.e., x and y are close to each other. This is called catastrophic cancellation.
If δx = δy , in particular, if both x and y are original data (δx = δy = 0),
3.3. STABILITY OF AN ALGORITHM 21
x⊗x⊖y⊗y (x ⊕ y) ⊗ (x ⊖ y)
= (x2 (1 + δ1 ) − y 2 (1 + δ2 ))(1 + δ3 ) = (x + y)(1 + δ1 )(x − y)(1 + δ2 )(1 + δ3 )
≈ (x(1 + δ1 +δ 2
2 )) − (y(1 +
3 δ2 +δ3 2
2 )) = (x2 − y 2 )(1 + δ1 + δ2 + δ3 )
2
= (x(1 + δx )) − (y(1 + δy )) 2 = (x(1 + δx ))2 − (y(1 + δy ))2
|δi | ≤ u, |δx |, |δy | ≤ u |δi | ≤ u, |δx | = |δy | ≤ 3u/2
The above backward analysis shows that both algorithms have small back-
ward errors. However, the following forward error analysis shows that the
algorithm on the left can produce large error when |x| and |y| are close due
to catastrophic cancellation; the algorithm on the right guarantees accurate
result by avoiding catastrophic cancellation.
x⊗x⊖y⊗y (x ⊕ y) ⊗ (x ⊖ y)
= (x2 − y 2 )(1 + δ3 ) + x2 δ1 − y 2 δ2 = (x2 − y 2 )(1 + δ1 + δ2 + δ3 )
(A + ∆A)x̂ = b.
The smallest ∆A is
0 0
∆A = ,
0 −3.2
which is of the same size as A. This means that Gaussian elimination
(without pivoting) is unstable. Note that the exact solution is
−3.2 · · ·
x= .
1.0032 · · ·
The magnification 2|x2 + y 2 |/|x2 − y 2 | is called the condition number for the
problem. When |x| is close to |y| the condition number is large, that is, the
result x2 − y 2 can be very sensitive to the perturbations in x and y.
3.4. SENSITIVITY OF A PROBLEM 23
Note that the condition number is an upper bound for any method.
Example 3.2.2 shows that the algorithm on the left can achieve the upper
bound while the well designed algorithm on the right produces accurate
result even when the condition number is large, that is, |x| is close to |y|.
Then we revisit the problem of solving a system of linear equations:
Ax = b
where A and b are known variables and x is the result. The question is: How
sensitive is x to the change in A and/or b?. We can assume that the change
is only in b since the change in A can be transformed into the change in b.
Let x̆ be the solution of the perturbed system:
Ax̆ = b + ∆b.
Example 3.4.1 A metal bar with one end fixed on a wall and another end
carrying a mass. To find out how the bar bends, we need to solve a symmetric
pentadiagonal linear system whose coefficient matrix has the form when N =
8:
9 −4 1
−4 6 −4 1
1 −4 6 −4 1
1 −4 6 −4 1
.
1 −4 6 −4 1
1 −4 6 −4 1
1 −4 5 −2
1 −2 1
The condition number of the matrix grows in the order of N 4 as shown in
the following table.
Note the small change in the coefficient −210 has caused ten of the zeros
to become complex and that two have moved more than 2.81 units off the
real axis. That means the zeros of p(x) are very sensitive to the change in
coefficients. The results were computed under a very accurate computation.
3.4. SENSITIVITY OF A PROBLEM 25
They did not get any side effects from rounding errors, and nor is it a
problem that the algorithm used solve this problem make some ill-effects on
the results. Actually the problem is the matter of sensitivity itself.
[1] David Goldberg. What every computer scientist should know about
floating-point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991.
27
Index
2-norm, 7 precision, 1
exponent range, 1
extra precision, 9
forward error, 16
gradual underflow, 6
hidden bit, 1
IEEE values, 6
infinities, 4
iterative refinement, 9
long double, 9
machine precision, 1
NaN, 4
overflow, 7
28