[go: up one dir, main page]

0% found this document useful (0 votes)
6 views30 pages

Elements of Floating-Point Arithmetic-07

Uploaded by

My Own
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views30 pages

Elements of Floating-Point Arithmetic-07

Uploaded by

My Own
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Elements of Floating-point Arithmetic

Sanzheng Qiao Yimin Wei

August 2007
ii
Chapter 1

Floating-point Numbers

1.1 Representations
On paper In storage
±d1 .d2 ...dt × β e stored in three fields,
0 < d1 < β, 0 ≤ di < β (i > 1) usually consecutive:
β: base (or radix) sign: 1 bit
e: exponent exponent: store e
t: precision fraction: store d1 , ..., dt
The base β is almost universally 2. In this case, d1 is always 1. Thus,
it can be implicit, saving one bit, called hidden bit. Other commonly used
bases are 10 and 16. Base 16 provides wider range, but wastes bits in
fraction. For example, in base 16, when 1.016 × 160 is stored in binary bits,
there are three leading zeros. In general, there can be up to log β − 1 leading
zeros.
A floating-point number system is characterized by four parameters:
• base β (also called radix)
• precision t
• exponent range emin ≤ e ≤ emax
Machine precision, ǫM , is defined as the distance between 1.0 and the next
larger floating-point number, that is, ǫM = β 1−t .
Floating-point numbers are used to approximate real numbers, fl(x) de-
notes the floating-point number that approximates a real number x.
One way of measuring the error in an approximation is related to the
absolute error. If the nearest rounding is applied and fl(x) = d1 .d2 · · · dt ×β e ,

1
2 CHAPTER 1. FLOATING-POINT NUMBERS

then the absolute error |x − fl(x)| is no larger than 1/2β −t+1 β e , which is half
of the unit in the last place in fl(x). Thus we define one ulp (unit in the last
place) of a floating-point number d1 .d2 · · · dt × β e as β −t+1 β e .
Another way of measuring the error in an approximation is related to
the relative error. Again, if the nearest rounding is applied, then the relative
error
|x − fl(x)| 1/2β −t+1 β e 1
≤ ≤ β −t+1 ,
|fl(x)| |fl(x)| 2

defined as the unit of roundoff denoted by u. When β = 2, u = 2−t .


To illustrate the difference between ulp and u, we consider the real num-
ber 31/64. When β = 2 and t = 4, it is approximated by 1.000 × 2−1 .
The relative error is bounded above by 2−4 = u, while the absolute error is
2−6 = 1/4 ulp, instead of 1/2 ulp, of 1.000 × 2−1 . The reason for wobbling
by a factor of β is that when 31/64 = 1.1111× 2−2 is rounded to 1.000× 2−1 ,
the exponent is increased by one.
Usually, we are only concerned about the magnitude of rounding error.
Then, ulps and u can be used interchangeably. If the error in a floating-point
number is n ulps or nu, then the number of contaminated digits is logβ n.
Also, since ǫM = 2u, the machine precision ǫM can also be used in place of
u or ulps.

1.2 A Small System


To simplify our presentation, we will often use a contrived small floating-
point number system characterized by

β t emin emax Exponent width Format width


2 4 −6 +7 4 bits 8 bits

From this small system, we can see that the floating-point numbers in
each interval [β e , β e+1 ), emin ≤ e ≤ emax , are equally spaced, with the
distance β −t+1 β e = 1 ulp.

Example The decimal number 0.1 cannot be represented exactly in the


small system, it is approximated by its nearest floating-point number 1.101×
2−4 . The relative error is no larger than u = 2−4 and the absolute error is
no larger than 2−8 = 1/2 ulp.
1.3. IEEE FLOATING-POINT STANDARD 3

1.3 IEEE Floating-point Standard


IEEE 754 is a binary standard that requires base β = 2 and

single precision double precision


precision t 24 53
emin −126 −1022
emax 127 1023

Single precision word format:

exp fraction

01 89 31

Precision
Double precision supports single, so that intermediate results in evaluating
an expression are computed in high precision. Why? Extra-precise arith-
metic attenuates the risk of error due to roundoff. For example, in evaluating
an inner product of two single precision vectors, the intermediate results are
computed in double precision to achieve high accuracy.

Exponent
Since the exponent can be positive or negative, some method must be chosen
to represent its sign. Two common methods of representing signed integers
are sign/magnitude and two’s complement. The IEEE binary standard does
not use either of these methods to represent the exponent, but instead uses
a biased representation for efficient calculation of exponents. In the case
of single precision, where the exponent is stored in 8 bits, the bias is 127
(for double precision it is 1023). What this means is that if k is the value
of the exponent bits interpreted as an unsigned integer (biased), then the
exponent of the floating-point number is k − 127 (unbiased).

Single Double
Exponent width in bits 8 11
Format width in bits 32 64
bias 127 1023

For example, in single precision, if the binary integer stored in the ex-
ponent field is 00000001, then the exponent of the floating-point number is
4 CHAPTER 1. FLOATING-POINT NUMBERS

−126 = emin . Similarly, the exponent emax = 127 is stored in the exponent
field by 11111110. What do 00000000 and 11111111 represent? It will be
discussed later.
Example Analogously, in our small system, we use bias 7. For example,
the binary format 1 0001 100 represents the binary floating-point number
−1.100 × 2−6 .

1.3.1 Special quantities


The IEEE standard specifies the following special values: ±∞, NaNs, ±0,
and denormalized numbers. These special values are all encoded with expo-
nents of either emax + 1 or emin − 1.

Infinities
There are two infinities: ±∞. There are two cases where an infinity is
produced. When a normal nonzero floating-point number is divided by zero,
an infinity is produced. Another case is when overflow occurs. Infinities
provide a way to continue the computation in these two cases. The infinity
symbol obeys the usual mathematical conventions regarding infinity, such
as ∞ + ∞ = ∞, (−1) ∗ ∞ = −∞, and (finite)/∞ = 0.

Example 1.3.1 Consider two floating-point numbers x = 1.000 × 25 and


y = 1.000 × 20 in our small system. Suppose that we want to compute
y x
sin θ = p or cos θ = p .
x2 + y2 x2 + y2

With infinities, if we compute sin θ, x2 overflows to +∞ and the computation


continues and the computed sin θ = 0. Without infities, the computation may
be interrupted or x2 is set to the largest number 1.111 × 27 and we will get
sin θ = 1.000 × 2−4 . As we know, when |y|/|x| is small, sin θ ≈ y/x =
1.000 × 2−5 . However, if we compute cos θ using the above formula, with
infinities we will get cos θ = x/ + ∞ = 0. Without infinities, if x2 is set to
the largest number in this case, we will get cos
pθ = 2. Both arep
very wrong. A
better formula is to replace the denominator x + y with m (1 + (y/m)2 ,
2 2

where m = max(|x|, |y|) = x. Then we will get the computed sin θ = 1.000 ×
2−5 or cos θ = 1.000 × 20 .

The infinity symbol is represented by a zero fraction and the same ex-
ponent field as a NaN (emax + 1); the sign bit distinguishes between ±∞.
1.3. IEEE FLOATING-POINT STANDARD 5

NaNs √
Traditionally, the computation of 0/0 or −1 has been treated as an unre-
coverable error which causes a computation to halt. However it makes sense
for a computation to continue in such a situation. IEEE standard has NaN
(Not a Number). Just as the infinities, NaNs
√ provide a way to continue the
computation in situations like 0/0 and −1. Suppose that the subroutine
zero(f,a,b) finds the zeros of a function f in the interval [a, b]. Normally,
we would not require the user to input the domain of f . If in finding a
zero, a guess falls outside the domain of f and the function is evaluated at
the guess, then it makes sense to continue the computation. For example,
f = (x2 − 1)/(x − 1), if the function is evaluated at the guess x = 1, the
result is NaN, thus x = 1 is rejected and the subroutine continues to find a
zero.

Operations that produce a NaN


Operation NaN Produced By
+ ∞ + (−∞)
∗ 0∗∞
/ 0/0, ∞/∞
REM x REM 0, ∞ REM y
sqrt sqrt(x) when x < 0

In IEEE 754, NaNs are represented as floating-point numbers with ex-


ponent emax + 1 and nonzero fractions. Thus there is not a unique NaN, but
rather a whole family of NaNs. In general, whenever a NaN participates in
a floating-point operation, the result is another NaN.

Signed zeros
Zero is represented by the exponent emin − 1 and a zero fraction. Since the
sign bit can take on two different values, there are two zeros: ±0. Naturally,
when comparing +0 and −0, +0 = −0.
Signed zeros can be useful. When x > 0 underflows to +0, 1/x gives +∞;
when x < 0 underflows to −0, 1/x gives −∞. They work like limx→0+ 1/x =
+∞ and limx→0− 1/x = −∞. Also, with signed zeros, we have x = 1/(1/x)
for x = ±∞. Without signed zeros, it would cause confusion. Another
example involves log x. With signed zeros, we preserve the mathematical
indentities limx→0+ log x = −∞ and limx→0− log x = NaN.
The above infinities, NaNs, and signed zeros can be similarly introduced
into our small floating-point number system. For example, 1 1111 000 rep-
resents −∞ and 1 1111 010 represents NaN.
6 CHAPTER 1. FLOATING-POINT NUMBERS

X X X X X X X

Figure 1.3.1: Denormalized numbers (between 0 and 2−6 marked by ×).

Denormalized numbers
In our small system, 1.000 × 2emin is stored in the bit format:
0 0001 000
The distance between this number and the next larger number 1.001 × 2emin
is 2−t+1 ×2emin . However, its distance to zero is 2emin . It makes sense to fill in
this gap with the numbers 0.001 × 2emin , ..., 0.111 × 2emin . These numbers are
called denormalized numbers. Figure 1.3.1 plots the positive denormalized
numbers in our small floating-point number system.
As shown above, without denormals, the spacing between two floating-
point numbers around β emin changes from β −t+1 β emin to β emin . With denor-
mals, adjacent spacing are either the same length or differ by a factor of β.
Thus, small numbers gradually underflow to zero.
Example 1.3.2 In our small floating-point system, if normalized numbers
x = 1.001 × 2−6 and y = 1.000 × 2−6 , then x − y is too small to be repre-
sented in normalized number range and must be flushed to zero (underflow),
although x 6= y. When the denormalized numbers are introduced, x − y does
not underflow, instead x−y = 0.001×2−6 . The use of denormalized numbers
guarantees
x − y = 0 ⇐⇒ x = y.
In the IEEE standard, the denormals are encoded by emin − 1 in the
exponent and a nonzero in the fraction. In this case, there is no hidden bit.
For example, using our small system, the bit format
0 0000 001
represents the denormalized number 0.001 × 2emin .
Denormals are less accurate due to the leading zeros. Assume our small
system. The computed the result of 1.011 × 2−4 ⊗ 1.010 × 2−4 = 0.011 × 2−6 ,
a denormal. The exact product is 1.10111 × 2−8 and the relative error
|1.10111 × 2−8 − 0.011 × 2−6 |
> 2−3 ,
1.10111 × 2−8
which is larger than the roundoff unit u = 2−4 .
Table 1.1 summarizes IEEE values.
1.3. IEEE FLOATING-POINT STANDARD 7

Exponent Fraction Represents


e = emin − 1 f =0 ±0
e = emin − 1 f 6= 0 0.f × 2emin
emin ≤ e ≤ emax − 1.f × 2e
e = emax + 1 f =0 ∞
e = emax + 1 f 6= 0 NaN

Table 1.1: IEEE 754 values


8 CHAPTER 1. FLOATING-POINT NUMBERS
Chapter 2

Floating-point Arithmetic

2.1 Overflow and Underflow

Overflow means that the exponent is too large to be represented in the


exponent field. Underflow means that the exponent is too small to be repre-
sented in the exponent field. In our small floating-point system, the largest
normal number Nmax is 1.111 × 27 , which is called the overflow thresh-
old. When the result is larger than Nmax overflow occurs. The smallest
positive normal number Nmin is 1.000 × 2−6 , which is called the underflow
threshold. With the denormalized numbers, the smallest positive number is
β −t+1 β emin = 0.001 × 2−6 . When the result is smaller than Nmin underflow
occurs.

2.1.1 Avoiding unnecessary over/underflow

Scaling is a commonly used technique to avoid unnecessary underflow and


overflow.

The following shows two algorithms for computing the 2-norm of a vector
x = [xi ], i = 1, ..., n.

9
10 CHAPTER 2. FLOATING-POINT ARITHMETIC

Without scaling With scaling


ssq = 0.0; scale = 0.0;
for i=1 to n ssq = 1.0;
ssq = ssq + x(i)*x(i); for i=1 to n
end if (x(i)<>0.0)
nrm2 = sqrt(ssq); if (scale<abs(x(i))
tmp = scale/x(i);
ssq = 1.0 + ssq*tmp*tmp;
scale = abs(x(i));
else
tmp = x(i)/scale;
ssq = ssq + tmp*tmp;
end
end
end
nrm2 = scale*sqrt(ssq);

2.2 Correctly Rounded Operations


When we apply a floating-point operation to floating-point numbers, the
exact result may not fit the format of the floating-point system. We must
round the exact result to fit the format of our system.
The IEEE standard requires that the following floating-point operations
are correctly rounded:

• arithmetic operations +, −, ∗, and /

• square root and remainder

• conversions of formats

The format conversions include:

• between floating-point formats, e.g., from single to double.

• between floating-point and integer formats.

• from floating-point to integer value (not format).

• between binary and decimal.

Correctly rounded means that result must be the same as if it were


computed exactly and then rounded, usually to the nearest floating-point
2.3. EXTRA PRECISION 11

number. For example, if ⊕ denotes the floating-point addition, then given


two floating-point numbers a and b,
a ⊕ b = fl(a + b).
For example, suppose that in our small floating-point system a = 1.110 × 20
and b = 1.111 × 2−1 , then a + b = 1.01011 × 21 , which does not fit the format
of our system. We require that a ⊕ b = fl(a + b) = 1.011 × 21 .
From the above requirement, a⊕b = b⊕a = fl(a+b). However, (a⊕b)⊕c
is not necessarily equal to a ⊕ (b ⊕ c) and a ⊗ (b ⊕ c) is not necessarily equal
to a ⊗ b ⊕ a ⊗ c.
Note that the correctly rounded operations do not include transcendental
functions. Thus, for example, the following two expressions for x3 :
x**3 x*x*x
the right expression is safer than the left.

2.3 Extra Precision


Most processors, including Intel and Motorola, support extra precision,
called logn double by the IEEE standard. On those processors, interme-
diate results are accumulated in 10 byte registers (15 bit exponent and 64
bit significant, no hidden bit), before the final result is stored in 53 bit signif-
icant (one hidden bit) in memory. Extra precision produces more accurate
results.
One application of extra precision is in solving linear systems Ax = b.
Suppose that the accuracy of the computed solution x is unsatisfactory. A
technique called iterative refinement can be used to improve the accuracy.
Here is the algorithm:

1. Compute the residual r = Ax − b;


2. Solve for the correction d in Ad = r;
3. Update the solution x = x − d.

The above procedure is iterated until the correction is sufficiently small. It-
erative refinement improves the solution only when the residual is calculated
in high precision.
How can you tell whether the system, including the processor and the
compiler, on your machine supports extra precision? Using our small number
system as an example to illustrate the idea, you can write the expression:
12 CHAPTER 2. FLOATING-POINT ARITHMETIC

x = (1.0 + 1.25e-1)*(1.0 - 1.25e-1) - 1.0;


If the result x = −2−6 = −0.015625, then the system supports extra
precision. If x = 0.0, then the system does not support extra precision.
Why? If the system supports extra precision, say it accumulates interme-
diate results in 6 bit registers, then it accumulates the intermediate result
(1 + 2−3 )(1 − 2−3 ) = 1 − 2−6 in a 6 bit register in full accuracy. After sub-
racting one from it, we get −2−6 . Without extra precision, the intermediate
result 1 − 2−6 would be rounded to 4 bit significant resulting 1.0, thus the
final result is 0.

2.4 Rounding Modes


In the IEEE standard, rounding occurs whenever an operation has a re-
sult that is not exact. By default, rounding means round toward nearest.
The standard requires that three other rounding modesbe provided, namely
round toward 0, round toward +∞, and round toward −∞. The four round-
ing modes are:
1. Round to +∞: always round up to the first representable floating
point number.
2. Round to −∞: always round down to the first representable floating
point number.
3. Round to 0: always truncate the digits that after the last representable
digit.
4. Round to nearest even: always round to the nearest floating point
number. In the case of a tie, the one with its least significant bit equal
to zero is chosen.

Example Consider two floating point numbers: a = b = 1.010 × 21 in our


small floating-point number system. The exact value of a ∗ b = 1.1001 × 22 .
This value is between the two floating-point numbers 1.100 × 22 and 1.101 ×
22 . Table 2.1 lists the rounded values under different rounding modes.
Why dynamic directed rounding modes? Testing numerical sensitivity.
How? Different rounding introduces slightly different rounding errors. If
slightly perturbed intermediate results cause significant changes in the final
results, the program is sensitive to the perturbation.
Another application of dynamic directed rounding modes is interval
arithmetic.
2.4. ROUNDING MODES 13

Round → +∞ → −∞ →0 → nearest
1.1001 × 22 1.101 × 22 1.100 × 22 1.100 × 22 1.100 × 22

Table 2.1: Four rounding modes


14 CHAPTER 2. FLOATING-POINT ARITHMETIC
Chapter 3

Error Analysis

3.1 Sources of Errors


When an infinite series is approximated by a finite sum, truncation error is
introduced. For example, if we use

x2 x3 xn
1+x+ + + ··· +
2! 3! n!
to approximate

x2 x3 xn
ex = 1 + x + + + ··· + + ···,
2! 3! n!
then the truncation error is
xn+1 xn+2
+ + ···.
(n + 1)! (n + 2)!

When a continuous problem is approximated by a discrete one, dis-


cretization error is introduced. For example, from the expansion

h2 ′′
f (x + h) = f (x) + hf ′ (x) + f (ξ), for some ξ ∈ [x, x + h],
2!
we can use the following approximation:

f (x + h) − f (x)
yh (x) = ≈ f ′ (x).
h
The discretization error is Edis = |f ′′ (ξ)|h/2.

15
16 CHAPTER 3. ERROR ANALYSIS

h yh (1) error
10−1 2.85884195487388 1.40560126415e − 1
10−2 2.73191865578708 1.36368273280e − 2
10−3 2.71964142253278 1.35959407373e − 3
10−4 2.71841774707848 1.35918619434e − 4
10−5 2.71829541991231 1.35914532646e − 5
10−6 2.71828318698653 1.35852748429e − 6
10−7 2.71828196396484 1.35505794585e − 7
10−8 2.71828177744737 −5.10116753283e − 8
10−9 2.71828159981169 −2.28647355716e − 7
10−10 2.71827893527643 −2.89318261570e − 6
10−11 2.71827005349223 −1.17749668154e − 5
10−12 2.71827005349223 −1.17749668154e − 5
10−13 2.71338507218388 −4.89675627517e − 3
10−14 2.66453525910038 −5.37465693587e − 2
10−15 2.66453525910038 −5.37465693587e − 2

Table 3.1: Values of yh (1) and errors using various sizes of h

Note that both truncation error and discretization error have nothing to
do with computation. If the arithmetic is perfect (no rounding errors), the
discretization error decreases as h decreases. In practice, however, rounding
errors are unavoidable. Consider the above example and let f (x) = ex . We
computed yh (1) on a SUN Sparc V in MATLAB 5.20.
Table 3.1 shows that as h decreases the error first decreases and then
increases. This is because of the combination of discretization error and
rounding errors. In this example, the discretization error is
h ′′ h h
Edis = |f (ξ)| ≤ e1+h ≈ e for small h.
2 2 2
Now, we consider the rounding errors. Let the computed yh (x) be

(e(x+h)(1+δ0 ) (1 + δ1 ) − ex (1 + δ2 ))(1 + δ3 )
ŷh (x) = (1 + δ4 )
h
ex+h (1 + δ0 + δ1 + δ3 + δ4 ) − ex (1 + δ2 + δ3 + δ4 )
≈ ,
h
for |δi | ≤ u (i = 0, 1, 2, 3, 4). In the above derivation, we assume that δi are
small so that we ignore the terms like δi δj or higher order. We also assume
3.2. FORWARD AND BACKWARD ERRORS 17

−5
x 10
2

1.8

1.6

1.4

1.2

TOTAL ERROR
1

0.8

0.6

0.4

0.2

0
−10 −9 −8 −7 −6 −5
10 10 10 10 10 10
H

Figure 3.1.1: Total Error

that ex is computed accurately, that is, the computed ex equals ex (1 + δ)


where |δ| ≤ u. Thus we have the rounding error

ξ1 ex+h − ξ2 ex
Eround = |yh (x) − ŷh (x)| ≈ for |ξ1 | ≤ 4u and |ξ2 | ≤ 3u.
h

When x = 1, we have
7u
e. Eround ≈
h
So the rounding error increases as h decreases. Combining both errors, we
get the total error:
 
h 7u
Etotal = Edis + Eround ≈ + e.
2 h

Figure 3.1.1 plots Etotal .


To minimize the total error, we differentiate Etotal with respect to h and
set the derivative to zero and get the optimal h:
√ √
hopt = 14u ≈ u.

3.2 Forward and Backward Errors


Suppose a program takes an input x and computes y, we can view the output
y as a function of the input x, y = f (x). Denote the computed result as ŷ,
18 CHAPTER 3. ERROR ANALYSIS

then the absolute error |y − ŷ| and the relative error |y − ŷ|/|y| are called
forward errors. Alternatively, we can ask: “For what set of data have we
solved our problem?”. That is, the computed result ŷ is the exact result for
the input x + ∆x, i.e., ŷ = f (x + ∆x). In general, there may be many such
∆x, so we are interested in minimal such ∆x and a bound for |∆x|. This
bound, possibly divided by |x|, is called backward error.

For example, consider the problem of computing the square root y = x.
The IEEE standard requires that the computed square root
√ √
ŷ = fl( x) = x (1 + δ), |δ| ≤ u.

Then the relative error or the forward error is |δ|, which is bounded by u.
What is the backward error? Set
√ √
ŷ = x (1 + δ) = x + ∆x,

then ∆x = 2xδ + xδ2 . Thus, ignoring δ2 , we have the backward error:

|∆x|/|x| ≈ 2|δ| ≤ 2u.

Example 3.2.1 Computing the sum:

s n = x1 ⊕ x 2 ⊕ · · · ⊕ x n

Denote the partial sum si = x1 ⊕ x2 ⊕ · · · ⊕ xi , then

s2 = x1 ⊕ x2 = x1 (1 + ǫ1 ) + x2 (1 + ǫ1 )
s3 = s2 ⊕ x3 = (s2 + x3 )(1 + ǫ2 )
= x1 (1 + ǫ1 )(1 + ǫ2 ) + x2 (1 + ǫ1 )(1 + ǫ2 )
+ x3 (1 + ǫ2 )
sn = x1 (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
+ x2 (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
+ x3 (1 + ǫ2 ) · · · (1 + ǫn−1 )
+ ···
+ xn−1 (1 + ǫn−2 )(1 + ǫn−1 )
+ xn (1 + ǫn−1 )

Define

1 + η1 = (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
3.2. FORWARD AND BACKWARD ERRORS 19

1 + η2 = (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
1 + η3 = (1 + ǫ2 ) · · · (1 + ǫn−1 )
···
1 + ηn−1 = (1 + ǫn−2 )(1 + ǫn−1 )
1 + ηn = (1 + ǫn−1 )

|ηn | = |ǫn−1 | ≤ u
1 + ηn−1 = 1 + (ǫn−2 + ǫn−1 ) + ǫn−2 ǫn−1
|ηn−1 | ≈ |ǫn−2 + ǫn−1 | ≤ 2u

In general,

|η1 | ≤ (n − 1)u
|ηi | ≤ (n − i + 1)u, i = 2, 3, ..., n

A more rigorous bound: If nu ≤ 0.1 and |ǫi | ≤ u (i = 1, 2, ..., n), then

(1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn ) = 1 + η

where
|η| ≤ 1.06nu.
Thus
x1 ⊕ · · · ⊕ xn = x1 (1 + η1 ) + · · · + xn (1 + ηn )
|η1 | ≤ 1.06(n − 1)u
|ηi | ≤ 1.06(n − i + 1)u, i = 2, 3, ..., n

How bad is the accumulation error? Using double precision, u = 10−15 ,


for nu ≥ 10−4 , n ≥ 1011 .
PnThere is an accurate and efficient summation formula by Kahan. If
i=1 xi is computed using the following algorithm:

s = x(1); % partial sum


c = 0.0; % correction
for i=2:n
y = x(i) - c; % correction from previous iteration
t = s + y; % update partial sum
c = (t - s) - y; % update correction
s = t;
end
20 CHAPTER 3. ERROR ANALYSIS

Then the computed sum s is equal to


n
X n
X
xi (1 + ηi ) + O(nu2 ) |xi |, where |ηi | ≤ 2u.
i=1 i=1

This gives much smaller backward error. A formal proof, much longer and
trickier than the algorithm, can be found in [2, Page 615] or [1]. Here is an
intuitive explanation [1]. In the step t = s + y of updating the partial sum
s, we have
s
+ yh yl
t
where y is partitioned into its high-order bits yh and low-order bits yl ac-
cording to s. Next, t − s gives an approximation of yh . Thus, c = (t − s) − y
is an proximation of −yl , which is used as a correction incorporated into the
next summand xi .
The process of bounding the backward error is called backward error
analysis. The motivation is to interpret rounding errors as perturbations in
the data. Consequently, it reduces the question of estimating the forward
error to perturbation theory. We will see its significance in the following
sections.
To illustrate the forward and backward errors, let us consider the compu-
tation of x̂ − ŷ , where x̂ and ŷ can be previously computed results. Assume
that x and y are exact results and x̂ = x(1 + δx ) and ŷ = y(1 + δy ), then

x̂ ⊖ ŷ = fl(x̂ − ŷ) = (x̂ − ŷ)(1 + δ), |δ| ≤ u.

It then follows that x̂ ⊖ ŷ = x(1 + δx )(1 + δ) − y(1 + δy )(1 + δ). Ignoring the
second order terms δx δ and δy δ and letting δ1 = δx + δ and δ2 = δy + δ, we
get
x̂ ⊖ ŷ = x(1 + δ1 ) − y(1 + δ2 ).
If |δx | and |δy | are small, then |δ1 |, |δ2 | ≤ u + max(|δx |, |δy |) are also small
i.e., the backward errors are small. However, the forward error (relative
error)
|(x̂ ⊖ ŷ) − (x − y)| |xδ1 − yδ2 |
Erel = = .
|x − y| |x − y|
If δ1 6= δ2 i.e., δx 6= δy , it is possible that Erel is large when |x − y| is small,
i.e., x and y are close to each other. This is called catastrophic cancellation.
If δx = δy , in particular, if both x and y are original data (δx = δy = 0),
3.3. STABILITY OF AN ALGORITHM 21

then Erel = δ. This is called benign cancellation. The following example


illustrates the difference between the two types of cancellations.

Example 3.2.2 Two algorithms for computing z = x2 − y 2 .

x⊗x⊖y⊗y (x ⊕ y) ⊗ (x ⊖ y)
= (x2 (1 + δ1 ) − y 2 (1 + δ2 ))(1 + δ3 ) = (x + y)(1 + δ1 )(x − y)(1 + δ2 )(1 + δ3 )
≈ (x(1 + δ1 +δ 2
2 )) − (y(1 +
3 δ2 +δ3 2
2 )) = (x2 − y 2 )(1 + δ1 + δ2 + δ3 )
2
= (x(1 + δx )) − (y(1 + δy )) 2 = (x(1 + δx ))2 − (y(1 + δy ))2
|δi | ≤ u, |δx |, |δy | ≤ u |δi | ≤ u, |δx | = |δy | ≤ 3u/2

The above backward analysis shows that both algorithms have small back-
ward errors. However, the following forward error analysis shows that the
algorithm on the left can produce large error when |x| and |y| are close due
to catastrophic cancellation; the algorithm on the right guarantees accurate
result by avoiding catastrophic cancellation.

x⊗x⊖y⊗y (x ⊕ y) ⊗ (x ⊖ y)
= (x2 − y 2 )(1 + δ3 ) + x2 δ1 − y 2 δ2 = (x2 − y 2 )(1 + δ1 + δ2 + δ3 )

|ẑ − z|/|z| |ẑ − z|/|z|


≤ |δ3 | + |x2 δ1 − y 2 δ2 |/|z| = |δ1 + δ2 + δ3 |
≤ u(1 + (x2 + y 2 )/|z|) ≤ 3u.

3.3 Stability of an Algorithm


A method for computing y = f (x) is called backward stable if, for any x, it
produces a computed ŷ with a small backward error, that is, ŷ = f (x + ∆x)
for some small ∆x. Usually there exist many such ∆x. We are interested in
the smallest. If ∆x turns out to be large, then the algorithm is unstable.

Example 3.3.1 Suppose β = 10 and t = 3. Consider the following system:


   
.001 1.00 1.00
Ax = b, where A = and b = .
1.00 .200 −3.00

Applying Gaussian elimination (without pivoting), we get the computed de-


composition   
bU
L b = 1.00 0 .001 1.00
.
1000 1.00 0 −1000
22 CHAPTER 3. ERROR ANALYSIS

Let the computed solution  


0
x̂ =
1.00
be the exact solution of the perturbed system

(A + ∆A)x̂ = b.

Solving for ∆A, we get


 
× 0
∆A = .
× −3.2

The smallest ∆A is  
0 0
∆A = ,
0 −3.2
which is of the same size as A. This means that Gaussian elimination
(without pivoting) is unstable. Note that the exact solution is
 
−3.2 · · ·
x= .
1.0032 · · ·

3.4 Sensitivity of a Problem


Backward analysis processes the rounding error so that the computed result
becomes the exact solution of a perturbed problem. How sensitive is the
solution to the perturbations in the data of a problem?
Let us first revisit the problem of evaluating x2 − y 2 . Suppose that the
data x and y are perturbed by relative errors ǫx and ǫy respectively. We
try to find out how much the relative errors in data can be magnified in the
relative error of the result:
|(x(1 + ǫx ))2 − (y(1 + ǫy ))2 − (x2 − y 2 )|
|x2 − y 2 |
|2x2 ǫx − 2y 2 ǫy |

|x2 − y 2 |
2|x2 + y 2 |
≤ ǫ, ǫ = max(|ǫx |, |ǫy |).
|x2 − y 2 |

The magnification 2|x2 + y 2 |/|x2 − y 2 | is called the condition number for the
problem. When |x| is close to |y| the condition number is large, that is, the
result x2 − y 2 can be very sensitive to the perturbations in x and y.
3.4. SENSITIVITY OF A PROBLEM 23

Note that the condition number is an upper bound for any method.
Example 3.2.2 shows that the algorithm on the left can achieve the upper
bound while the well designed algorithm on the right produces accurate
result even when the condition number is large, that is, |x| is close to |y|.
Then we revisit the problem of solving a system of linear equations:

Ax = b

where A and b are known variables and x is the result. The question is: How
sensitive is x to the change in A and/or b?. We can assume that the change
is only in b since the change in A can be transformed into the change in b.
Let x̆ be the solution of the perturbed system:

Ax̆ = b + ∆b.

The change in x (relative error) is kx̆ − xk/kxk and the change in b is


k∆bk/kbk. We use the ratio of the two errors as the condition number:

kx̆ − xk/kxk kA−1 ∆bk kAxk


cond = = · ≤ kA−1 k kAk.
k∆bk/kbk kxk k∆bk

So kA−1 k kAk is the condition number of the problem of solving a linear


system.

Example 3.4.1 A metal bar with one end fixed on a wall and another end
carrying a mass. To find out how the bar bends, we need to solve a symmetric
pentadiagonal linear system whose coefficient matrix has the form when N =
8:  
9 −4 1
 −4 6 −4 1 
 
 1 −4 6 −4 1 
 
 1 −4 6 −4 1 
 .
 1 −4 6 −4 1 
 
 1 −4 6 −4 1 
 
 1 −4 5 −2 
1 −2 1
The condition number of the matrix grows in the order of N 4 as shown in
the following table.

N 10 100 1000 10000


cond 1.4e+04 1.3e+08 1.3e+12 1.3e+16
24 CHAPTER 3. ERROR ANALYSIS

To reduce the discretization error, we need a large N , usually in thousands.


However, when N is large, the system is ill-conditioned, thus the computed
solution is severely contaminated by rounding errors. The iterative refine-
ment can attenuate the risk.

In general, we can view a problem with data x and result y as a function


y = f (x). The result of the perturbed problem is y̆ = f (x + ∆x). The
sensitivity is measured by

|y̆ − y|/|y| |f (x + ∆x) − f (x)| |x| |x|


cond = = ≈ |f ′ (x)| . (3.4.1)
|∆x|/|x| |∆x| |f (x)| |f (x)|

Note that the conditioning of a problem is independent of rounding errors


and algorithms for solving the problem. The following example is due to
Wilkinson(see [3].

Example 3.4.2 Let p(x) = (x−1)(x−2)...(x−19)(x−20) = x20 −210x19 +


.... The zeros of p(x) are 1, 2, ..., 19, 20 and well separated. With the floating-
point number system of β = 2, t = 30 we enter a typical coefficient into
the computer, it is necessary to round it to 30 significant base-2 digits. If
we make a change in the 30th significant base-2, only one of the twenty
coefficients, the coefficient of x19 , is changed from −210 to −210 + 2−23 . Let
us see how much effect this small change has on the zeros of the polynomial.
Here we list (using β = 2, t = 90) the roots of the equation p(x) + 2−23 x19 =
0, correctly rounded to the number of digits:

1.00000 0000 10.09526 6145 ± 0.64350 0904i


2.00000 0000 11.79363 3881 ± 1.65232 9728i
3.00000 0000 13.99235 8137 ± 2.51883 0070i
4.00000 0000 16.73073 7466 ± 2.81262 4894i
4.99999 9928 19.50243 9400 ± 1.94033 0347i
6.00000 6944
6.99969 7234
8.00726 7603
8.91725 0249
20.84690 8101

Note the small change in the coefficient −210 has caused ten of the zeros
to become complex and that two have moved more than 2.81 units off the
real axis. That means the zeros of p(x) are very sensitive to the change in
coefficients. The results were computed under a very accurate computation.
3.4. SENSITIVITY OF A PROBLEM 25

They did not get any side effects from rounding errors, and nor is it a
problem that the algorithm used solve this problem make some ill-effects on
the results. Actually the problem is the matter of sensitivity itself.

As discussed before, backward error analysis transforms rounding errors


into perturbations of data. Thus we can establish a relation between forward
and backward errors and the conditioning of the problem. Clearly, (3.4.1)
shows that
Eforward ≤ cond · Ebackward .
This inequality tells us that large forward errors can be caused by either ill-
conditioning of the problem or unstable algorithm, or both. The significance
of backward error analysis is that it allows us to determine whether an
algorithm is stable (small backward errors). If we can prove the algorithm
is stable, in other words, the backward errors are small, say, no larger than
the measurement errors in data, then we know that large forward errors are
due to the ill-conditioning of the problem. On the other hand, if we know
the problem is well-conditioned, then large forward errors must be caused
by unstable algorithm.
As discussed in Section 2.4, dynamic directed rounding can be used to
test a program’s sensitivity to perturbations.
26 CHAPTER 3. ERROR ANALYSIS
Bibliography

[1] David Goldberg. What every computer scientist should know about
floating-point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991.

[2] Donald E. Knuth. The Art of Computer Programming, volume II.


Addison-Wesley, Reading Mass,, 2 edition, 1997.

[3] J.H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice Hall,


Englewood Cliffs, N.J., 1963.

27
Index

2-norm, 7 precision, 1

backward error, 16 rounding modes, 10


backward stable, 19
base, 1 scaling, 7
benign cancellation, 18 signed zeros, 5
biased representation, 3 single precision, 3

catastrophic cancellation, 18 truncation error, 13


condition number, 20
ulp, 2
correctly rounded operations, 8
underflow, 7
denormals, 5 unit of roundoff, 2
discretization error, 13

exponent range, 1
extra precision, 9

forward error, 16

gradual underflow, 6

hidden bit, 1

IEEE values, 6
infinities, 4
iterative refinement, 9

long double, 9

machine precision, 1

NaN, 4

overflow, 7

28

You might also like