0% found this document useful (0 votes)

6 views30 pages

Elements of Floating-Point Arithmetic-07

Uploaded by

My Own

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views30 pages

Elements of Floating-Point Arithmetic-07

Uploaded by

My Own

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Elements of Floating-point Arithmetic

Sanzheng Qiao Yimin Wei

August 2007
ii
Chapter 1

Floating-point Numbers

1.1 Representations
On paper In storage
±d1 .d2 ...dt × β e stored in three fields,
0 < d1 < β, 0 ≤ di < β (i > 1) usually consecutive:
β: base (or radix) sign: 1 bit
e: exponent exponent: store e
t: precision fraction: store d1 , ..., dt
The base β is almost universally 2. In this case, d1 is always 1. Thus,
it can be implicit, saving one bit, called hidden bit. Other commonly used
bases are 10 and 16. Base 16 provides wider range, but wastes bits in
fraction. For example, in base 16, when 1.016 × 160 is stored in binary bits,
there are three leading zeros. In general, there can be up to log β − 1 leading
zeros.
A floating-point number system is characterized by four parameters:
• base β (also called radix)
• precision t
• exponent range emin ≤ e ≤ emax
Machine precision, ǫM , is defined as the distance between 1.0 and the next
larger floating-point number, that is, ǫM = β 1−t .
Floating-point numbers are used to approximate real numbers, fl(x) de-
notes the floating-point number that approximates a real number x.
One way of measuring the error in an approximation is related to the
absolute error. If the nearest rounding is applied and fl(x) = d1 .d2 · · · dt ×β e ,

1
2 CHAPTER 1. FLOATING-POINT NUMBERS

then the absolute error |x − fl(x)| is no larger than 1/2β −t+1 β e , which is half
of the unit in the last place in fl(x). Thus we define one ulp (unit in the last
place) of a floating-point number d1 .d2 · · · dt × β e as β −t+1 β e .
Another way of measuring the error in an approximation is related to
the relative error. Again, if the nearest rounding is applied, then the relative
error
|x − fl(x)| 1/2β −t+1 β e 1
≤ ≤ β −t+1 ,
|fl(x)| |fl(x)| 2

defined as the unit of roundoff denoted by u. When β = 2, u = 2−t .

To illustrate the difference between ulp and u, we consider the real num-
ber 31/64. When β = 2 and t = 4, it is approximated by 1.000 × 2−1 .
The relative error is bounded above by 2−4 = u, while the absolute error is
2−6 = 1/4 ulp, instead of 1/2 ulp, of 1.000 × 2−1 . The reason for wobbling
by a factor of β is that when 31/64 = 1.1111× 2−2 is rounded to 1.000× 2−1 ,
the exponent is increased by one.
Usually, we are only concerned about the magnitude of rounding error.
Then, ulps and u can be used interchangeably. If the error in a floating-point
number is n ulps or nu, then the number of contaminated digits is logβ n.
Also, since ǫM = 2u, the machine precision ǫM can also be used in place of
u or ulps.

1.2 A Small System

To simplify our presentation, we will often use a contrived small floating-
point number system characterized by

β t emin emax Exponent width Format width

2 4 −6 +7 4 bits 8 bits

From this small system, we can see that the floating-point numbers in
each interval [β e , β e+1 ), emin ≤ e ≤ emax , are equally spaced, with the
distance β −t+1 β e = 1 ulp.

Example The decimal number 0.1 cannot be represented exactly in the

small system, it is approximated by its nearest floating-point number 1.101×
2−4 . The relative error is no larger than u = 2−4 and the absolute error is
no larger than 2−8 = 1/2 ulp.
1.3. IEEE FLOATING-POINT STANDARD 3

1.3 IEEE Floating-point Standard

IEEE 754 is a binary standard that requires base β = 2 and

single precision double precision

precision t 24 53
emin −126 −1022
emax 127 1023

Single precision word format:

exp fraction

01 89 31

Precision
Double precision supports single, so that intermediate results in evaluating
an expression are computed in high precision. Why? Extra-precise arith-
metic attenuates the risk of error due to roundoff. For example, in evaluating
an inner product of two single precision vectors, the intermediate results are
computed in double precision to achieve high accuracy.

Exponent
Since the exponent can be positive or negative, some method must be chosen
to represent its sign. Two common methods of representing signed integers
are sign/magnitude and two’s complement. The IEEE binary standard does
not use either of these methods to represent the exponent, but instead uses
a biased representation for efficient calculation of exponents. In the case
of single precision, where the exponent is stored in 8 bits, the bias is 127
(for double precision it is 1023). What this means is that if k is the value
of the exponent bits interpreted as an unsigned integer (biased), then the
exponent of the floating-point number is k − 127 (unbiased).

Single Double
Exponent width in bits 8 11
Format width in bits 32 64
bias 127 1023

For example, in single precision, if the binary integer stored in the ex-
ponent field is 00000001, then the exponent of the floating-point number is
4 CHAPTER 1. FLOATING-POINT NUMBERS

−126 = emin . Similarly, the exponent emax = 127 is stored in the exponent
field by 11111110. What do 00000000 and 11111111 represent? It will be
discussed later.
Example Analogously, in our small system, we use bias 7. For example,
the binary format 1 0001 100 represents the binary floating-point number
−1.100 × 2−6 .

1.3.1 Special quantities

The IEEE standard specifies the following special values: ±∞, NaNs, ±0,
and denormalized numbers. These special values are all encoded with expo-
nents of either emax + 1 or emin − 1.

Infinities
There are two infinities: ±∞. There are two cases where an infinity is
produced. When a normal nonzero floating-point number is divided by zero,
an infinity is produced. Another case is when overflow occurs. Infinities
provide a way to continue the computation in these two cases. The infinity
symbol obeys the usual mathematical conventions regarding infinity, such
as ∞ + ∞ = ∞, (−1) ∗ ∞ = −∞, and (finite)/∞ = 0.

Example 1.3.1 Consider two floating-point numbers x = 1.000 × 25 and

y = 1.000 × 20 in our small system. Suppose that we want to compute
y x
sin θ = p or cos θ = p .
x2 + y2 x2 + y2

With infinities, if we compute sin θ, x2 overflows to +∞ and the computation

continues and the computed sin θ = 0. Without infities, the computation may
be interrupted or x2 is set to the largest number 1.111 × 27 and we will get
sin θ = 1.000 × 2−4 . As we know, when |y|/|x| is small, sin θ ≈ y/x =
1.000 × 2−5 . However, if we compute cos θ using the above formula, with
infinities we will get cos θ = x/ + ∞ = 0. Without infinities, if x2 is set to
the largest number in this case, we will get cos
pθ = 2. Both arep
very wrong. A
better formula is to replace the denominator x + y with m (1 + (y/m)2 ,
2 2

where m = max(|x|, |y|) = x. Then we will get the computed sin θ = 1.000 ×
2−5 or cos θ = 1.000 × 20 .

The infinity symbol is represented by a zero fraction and the same ex-
ponent field as a NaN (emax + 1); the sign bit distinguishes between ±∞.
1.3. IEEE FLOATING-POINT STANDARD 5

NaNs √
Traditionally, the computation of 0/0 or −1 has been treated as an unre-
coverable error which causes a computation to halt. However it makes sense
for a computation to continue in such a situation. IEEE standard has NaN
(Not a Number). Just as the infinities, NaNs
√ provide a way to continue the
computation in situations like 0/0 and −1. Suppose that the subroutine
zero(f,a,b) finds the zeros of a function f in the interval [a, b]. Normally,
we would not require the user to input the domain of f . If in finding a
zero, a guess falls outside the domain of f and the function is evaluated at
the guess, then it makes sense to continue the computation. For example,
f = (x2 − 1)/(x − 1), if the function is evaluated at the guess x = 1, the
result is NaN, thus x = 1 is rejected and the subroutine continues to find a
zero.

Operations that produce a NaN

Operation NaN Produced By
+ ∞ + (−∞)
∗ 0∗∞
/ 0/0, ∞/∞
REM x REM 0, ∞ REM y
sqrt sqrt(x) when x < 0

In IEEE 754, NaNs are represented as floating-point numbers with ex-

ponent emax + 1 and nonzero fractions. Thus there is not a unique NaN, but
rather a whole family of NaNs. In general, whenever a NaN participates in
a floating-point operation, the result is another NaN.

Signed zeros
Zero is represented by the exponent emin − 1 and a zero fraction. Since the
sign bit can take on two different values, there are two zeros: ±0. Naturally,
when comparing +0 and −0, +0 = −0.
Signed zeros can be useful. When x > 0 underflows to +0, 1/x gives +∞;
when x < 0 underflows to −0, 1/x gives −∞. They work like limx→0+ 1/x =
+∞ and limx→0− 1/x = −∞. Also, with signed zeros, we have x = 1/(1/x)
for x = ±∞. Without signed zeros, it would cause confusion. Another
example involves log x. With signed zeros, we preserve the mathematical
indentities limx→0+ log x = −∞ and limx→0− log x = NaN.
The above infinities, NaNs, and signed zeros can be similarly introduced
into our small floating-point number system. For example, 1 1111 000 rep-
resents −∞ and 1 1111 010 represents NaN.
6 CHAPTER 1. FLOATING-POINT NUMBERS

X X X X X X X

Figure 1.3.1: Denormalized numbers (between 0 and 2−6 marked by ×).

Denormalized numbers
In our small system, 1.000 × 2emin is stored in the bit format:
0 0001 000
The distance between this number and the next larger number 1.001 × 2emin
is 2−t+1 ×2emin . However, its distance to zero is 2emin . It makes sense to fill in
this gap with the numbers 0.001 × 2emin , ..., 0.111 × 2emin . These numbers are
called denormalized numbers. Figure 1.3.1 plots the positive denormalized
numbers in our small floating-point number system.
As shown above, without denormals, the spacing between two floating-
point numbers around β emin changes from β −t+1 β emin to β emin . With denor-
mals, adjacent spacing are either the same length or differ by a factor of β.
Thus, small numbers gradually underflow to zero.
Example 1.3.2 In our small floating-point system, if normalized numbers
x = 1.001 × 2−6 and y = 1.000 × 2−6 , then x − y is too small to be repre-
sented in normalized number range and must be flushed to zero (underflow),
although x 6= y. When the denormalized numbers are introduced, x − y does
not underflow, instead x−y = 0.001×2−6 . The use of denormalized numbers
guarantees
x − y = 0 ⇐⇒ x = y.
In the IEEE standard, the denormals are encoded by emin − 1 in the
exponent and a nonzero in the fraction. In this case, there is no hidden bit.
For example, using our small system, the bit format
0 0000 001
represents the denormalized number 0.001 × 2emin .
Denormals are less accurate due to the leading zeros. Assume our small
system. The computed the result of 1.011 × 2−4 ⊗ 1.010 × 2−4 = 0.011 × 2−6 ,
a denormal. The exact product is 1.10111 × 2−8 and the relative error
|1.10111 × 2−8 − 0.011 × 2−6 |
> 2−3 ,
1.10111 × 2−8
which is larger than the roundoff unit u = 2−4 .
Table 1.1 summarizes IEEE values.
1.3. IEEE FLOATING-POINT STANDARD 7

Exponent Fraction Represents

e = emin − 1 f =0 ±0
e = emin − 1 f 6= 0 0.f × 2emin
emin ≤ e ≤ emax − 1.f × 2e
e = emax + 1 f =0 ∞
e = emax + 1 f 6= 0 NaN

Table 1.1: IEEE 754 values

8 CHAPTER 1. FLOATING-POINT NUMBERS
Chapter 2

Floating-point Arithmetic

2.1 Overflow and Underflow

Overflow means that the exponent is too large to be represented in the

exponent field. Underflow means that the exponent is too small to be repre-
sented in the exponent field. In our small floating-point system, the largest
normal number Nmax is 1.111 × 27 , which is called the overflow thresh-
old. When the result is larger than Nmax overflow occurs. The smallest
positive normal number Nmin is 1.000 × 2−6 , which is called the underflow
threshold. With the denormalized numbers, the smallest positive number is
β −t+1 β emin = 0.001 × 2−6 . When the result is smaller than Nmin underflow
occurs.

2.1.1 Avoiding unnecessary over/underflow

Scaling is a commonly used technique to avoid unnecessary underflow and

overflow.

The following shows two algorithms for computing the 2-norm of a vector
x = [xi ], i = 1, ..., n.

9
10 CHAPTER 2. FLOATING-POINT ARITHMETIC

Without scaling With scaling

ssq = 0.0; scale = 0.0;
for i=1 to n ssq = 1.0;
ssq = ssq + x(i)*x(i); for i=1 to n
end if (x(i)<>0.0)
nrm2 = sqrt(ssq); if (scale<abs(x(i))
tmp = scale/x(i);
ssq = 1.0 + ssq*tmp*tmp;
scale = abs(x(i));
else
tmp = x(i)/scale;
ssq = ssq + tmp*tmp;
end
end
end
nrm2 = scale*sqrt(ssq);

2.2 Correctly Rounded Operations

When we apply a floating-point operation to floating-point numbers, the
exact result may not fit the format of the floating-point system. We must
round the exact result to fit the format of our system.
The IEEE standard requires that the following floating-point operations
are correctly rounded:

• arithmetic operations +, −, ∗, and /

• square root and remainder

• conversions of formats

The format conversions include:

• between floating-point formats, e.g., from single to double.

• between floating-point and integer formats.

• from floating-point to integer value (not format).

• between binary and decimal.

Correctly rounded means that result must be the same as if it were

computed exactly and then rounded, usually to the nearest floating-point
2.3. EXTRA PRECISION 11

number. For example, if ⊕ denotes the floating-point addition, then given

two floating-point numbers a and b,
a ⊕ b = fl(a + b).
For example, suppose that in our small floating-point system a = 1.110 × 20
and b = 1.111 × 2−1 , then a + b = 1.01011 × 21 , which does not fit the format
of our system. We require that a ⊕ b = fl(a + b) = 1.011 × 21 .
From the above requirement, a⊕b = b⊕a = fl(a+b). However, (a⊕b)⊕c
is not necessarily equal to a ⊕ (b ⊕ c) and a ⊗ (b ⊕ c) is not necessarily equal
to a ⊗ b ⊕ a ⊗ c.
Note that the correctly rounded operations do not include transcendental
functions. Thus, for example, the following two expressions for x3 :
x**3 x*x*x
the right expression is safer than the left.

2.3 Extra Precision

Most processors, including Intel and Motorola, support extra precision,
called logn double by the IEEE standard. On those processors, interme-
diate results are accumulated in 10 byte registers (15 bit exponent and 64
bit significant, no hidden bit), before the final result is stored in 53 bit signif-
icant (one hidden bit) in memory. Extra precision produces more accurate
results.
One application of extra precision is in solving linear systems Ax = b.
Suppose that the accuracy of the computed solution x is unsatisfactory. A
technique called iterative refinement can be used to improve the accuracy.
Here is the algorithm:

1. Compute the residual r = Ax − b;

2. Solve for the correction d in Ad = r;
3. Update the solution x = x − d.

The above procedure is iterated until the correction is sufficiently small. It-
erative refinement improves the solution only when the residual is calculated
in high precision.
How can you tell whether the system, including the processor and the
compiler, on your machine supports extra precision? Using our small number
system as an example to illustrate the idea, you can write the expression:
12 CHAPTER 2. FLOATING-POINT ARITHMETIC

x = (1.0 + 1.25e-1)*(1.0 - 1.25e-1) - 1.0;

If the result x = −2−6 = −0.015625, then the system supports extra
precision. If x = 0.0, then the system does not support extra precision.
Why? If the system supports extra precision, say it accumulates interme-
diate results in 6 bit registers, then it accumulates the intermediate result
(1 + 2−3 )(1 − 2−3 ) = 1 − 2−6 in a 6 bit register in full accuracy. After sub-
racting one from it, we get −2−6 . Without extra precision, the intermediate
result 1 − 2−6 would be rounded to 4 bit significant resulting 1.0, thus the
final result is 0.

2.4 Rounding Modes

In the IEEE standard, rounding occurs whenever an operation has a re-
sult that is not exact. By default, rounding means round toward nearest.
The standard requires that three other rounding modesbe provided, namely
round toward 0, round toward +∞, and round toward −∞. The four round-
ing modes are:
1. Round to +∞: always round up to the first representable floating
point number.
2. Round to −∞: always round down to the first representable floating
point number.
3. Round to 0: always truncate the digits that after the last representable
digit.
4. Round to nearest even: always round to the nearest floating point
number. In the case of a tie, the one with its least significant bit equal
to zero is chosen.

Example Consider two floating point numbers: a = b = 1.010 × 21 in our

small floating-point number system. The exact value of a ∗ b = 1.1001 × 22 .
This value is between the two floating-point numbers 1.100 × 22 and 1.101 ×
22 . Table 2.1 lists the rounded values under different rounding modes.
Why dynamic directed rounding modes? Testing numerical sensitivity.
How? Different rounding introduces slightly different rounding errors. If
slightly perturbed intermediate results cause significant changes in the final
results, the program is sensitive to the perturbation.
Another application of dynamic directed rounding modes is interval
arithmetic.
2.4. ROUNDING MODES 13

Round → +∞ → −∞ →0 → nearest
1.1001 × 22 1.101 × 22 1.100 × 22 1.100 × 22 1.100 × 22

Table 2.1: Four rounding modes

14 CHAPTER 2. FLOATING-POINT ARITHMETIC
Chapter 3

Error Analysis

3.1 Sources of Errors

When an infinite series is approximated by a finite sum, truncation error is
introduced. For example, if we use

x2 x3 xn
1+x+ + + ··· +
2! 3! n!
to approximate

x2 x3 xn
ex = 1 + x + + + ··· + + ···,
2! 3! n!
then the truncation error is
xn+1 xn+2
+ + ···.
(n + 1)! (n + 2)!

When a continuous problem is approximated by a discrete one, dis-

cretization error is introduced. For example, from the expansion

h2 ′′
f (x + h) = f (x) + hf ′ (x) + f (ξ), for some ξ ∈ [x, x + h],
2!
we can use the following approximation:

f (x + h) − f (x)
yh (x) = ≈ f ′ (x).
h
The discretization error is Edis = |f ′′ (ξ)|h/2.

15
16 CHAPTER 3. ERROR ANALYSIS

h yh (1) error
10−1 2.85884195487388 1.40560126415e − 1
10−2 2.73191865578708 1.36368273280e − 2
10−3 2.71964142253278 1.35959407373e − 3
10−4 2.71841774707848 1.35918619434e − 4
10−5 2.71829541991231 1.35914532646e − 5
10−6 2.71828318698653 1.35852748429e − 6
10−7 2.71828196396484 1.35505794585e − 7
10−8 2.71828177744737 −5.10116753283e − 8
10−9 2.71828159981169 −2.28647355716e − 7
10−10 2.71827893527643 −2.89318261570e − 6
10−11 2.71827005349223 −1.17749668154e − 5
10−12 2.71827005349223 −1.17749668154e − 5
10−13 2.71338507218388 −4.89675627517e − 3
10−14 2.66453525910038 −5.37465693587e − 2
10−15 2.66453525910038 −5.37465693587e − 2

Table 3.1: Values of yh (1) and errors using various sizes of h

Note that both truncation error and discretization error have nothing to
do with computation. If the arithmetic is perfect (no rounding errors), the
discretization error decreases as h decreases. In practice, however, rounding
errors are unavoidable. Consider the above example and let f (x) = ex . We
computed yh (1) on a SUN Sparc V in MATLAB 5.20.
Table 3.1 shows that as h decreases the error first decreases and then
increases. This is because of the combination of discretization error and
rounding errors. In this example, the discretization error is
h ′′ h h
Edis = |f (ξ)| ≤ e1+h ≈ e for small h.
2 2 2
Now, we consider the rounding errors. Let the computed yh (x) be

(e(x+h)(1+δ0 ) (1 + δ1 ) − ex (1 + δ2 ))(1 + δ3 )
ŷh (x) = (1 + δ4 )
h
ex+h (1 + δ0 + δ1 + δ3 + δ4 ) − ex (1 + δ2 + δ3 + δ4 )
≈ ,
h
for |δi | ≤ u (i = 0, 1, 2, 3, 4). In the above derivation, we assume that δi are
small so that we ignore the terms like δi δj or higher order. We also assume
3.2. FORWARD AND BACKWARD ERRORS 17

−5
x 10
2

1.8

1.6

1.4

1.2

TOTAL ERROR
1

0.8

0.6

0.4

0.2

0
−10 −9 −8 −7 −6 −5
10 10 10 10 10 10
H

Figure 3.1.1: Total Error

that ex is computed accurately, that is, the computed ex equals ex (1 + δ)

where |δ| ≤ u. Thus we have the rounding error

ξ1 ex+h − ξ2 ex
Eround = |yh (x) − ŷh (x)| ≈ for |ξ1 | ≤ 4u and |ξ2 | ≤ 3u.
h

When x = 1, we have
7u
e. Eround ≈
h
So the rounding error increases as h decreases. Combining both errors, we
get the total error:

h 7u
Etotal = Edis + Eround ≈ + e.
2 h

Figure 3.1.1 plots Etotal .

To minimize the total error, we differentiate Etotal with respect to h and
set the derivative to zero and get the optimal h:
√ √
hopt = 14u ≈ u.

3.2 Forward and Backward Errors

Suppose a program takes an input x and computes y, we can view the output
y as a function of the input x, y = f (x). Denote the computed result as ŷ,
18 CHAPTER 3. ERROR ANALYSIS

then the absolute error |y − ŷ| and the relative error |y − ŷ|/|y| are called
forward errors. Alternatively, we can ask: “For what set of data have we
solved our problem?”. That is, the computed result ŷ is the exact result for
the input x + ∆x, i.e., ŷ = f (x + ∆x). In general, there may be many such
∆x, so we are interested in minimal such ∆x and a bound for |∆x|. This
bound, possibly divided by |x|, is called backward error.
√
For example, consider the problem of computing the square root y = x.
The IEEE standard requires that the computed square root
√ √
ŷ = fl( x) = x (1 + δ), |δ| ≤ u.

Then the relative error or the forward error is |δ|, which is bounded by u.
What is the backward error? Set
√ √
ŷ = x (1 + δ) = x + ∆x,

then ∆x = 2xδ + xδ2 . Thus, ignoring δ2 , we have the backward error:

|∆x|/|x| ≈ 2|δ| ≤ 2u.

Example 3.2.1 Computing the sum:

s n = x1 ⊕ x 2 ⊕ · · · ⊕ x n

Denote the partial sum si = x1 ⊕ x2 ⊕ · · · ⊕ xi , then

s2 = x1 ⊕ x2 = x1 (1 + ǫ1 ) + x2 (1 + ǫ1 )
s3 = s2 ⊕ x3 = (s2 + x3 )(1 + ǫ2 )
= x1 (1 + ǫ1 )(1 + ǫ2 ) + x2 (1 + ǫ1 )(1 + ǫ2 )
+ x3 (1 + ǫ2 )
sn = x1 (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
+ x2 (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
+ x3 (1 + ǫ2 ) · · · (1 + ǫn−1 )
+ ···
+ xn−1 (1 + ǫn−2 )(1 + ǫn−1 )
+ xn (1 + ǫn−1 )

Define

1 + η1 = (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
3.2. FORWARD AND BACKWARD ERRORS 19

1 + η2 = (1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn−1 )
1 + η3 = (1 + ǫ2 ) · · · (1 + ǫn−1 )
···
1 + ηn−1 = (1 + ǫn−2 )(1 + ǫn−1 )
1 + ηn = (1 + ǫn−1 )

|ηn | = |ǫn−1 | ≤ u
1 + ηn−1 = 1 + (ǫn−2 + ǫn−1 ) + ǫn−2 ǫn−1
|ηn−1 | ≈ |ǫn−2 + ǫn−1 | ≤ 2u

In general,

|η1 | ≤ (n − 1)u
|ηi | ≤ (n − i + 1)u, i = 2, 3, ..., n

A more rigorous bound: If nu ≤ 0.1 and |ǫi | ≤ u (i = 1, 2, ..., n), then

(1 + ǫ1 )(1 + ǫ2 ) · · · (1 + ǫn ) = 1 + η

where
|η| ≤ 1.06nu.
Thus
x1 ⊕ · · · ⊕ xn = x1 (1 + η1 ) + · · · + xn (1 + ηn )
|η1 | ≤ 1.06(n − 1)u
|ηi | ≤ 1.06(n − i + 1)u, i = 2, 3, ..., n

How bad is the accumulation error? Using double precision, u = 10−15 ,

for nu ≥ 10−4 , n ≥ 1011 .
PnThere is an accurate and efficient summation formula by Kahan. If
i=1 xi is computed using the following algorithm:

s = x(1); % partial sum

c = 0.0; % correction
for i=2:n
y = x(i) - c; % correction from previous iteration
t = s + y; % update partial sum
c = (t - s) - y; % update correction
s = t;
end
20 CHAPTER 3. ERROR ANALYSIS

Then the computed sum s is equal to

n
X n
X
xi (1 + ηi ) + O(nu2 ) |xi |, where |ηi | ≤ 2u.
i=1 i=1

This gives much smaller backward error. A formal proof, much longer and
trickier than the algorithm, can be found in [2, Page 615] or [1]. Here is an
intuitive explanation [1]. In the step t = s + y of updating the partial sum
s, we have
s
+ yh yl
t
where y is partitioned into its high-order bits yh and low-order bits yl ac-
cording to s. Next, t − s gives an approximation of yh . Thus, c = (t − s) − y
is an proximation of −yl , which is used as a correction incorporated into the
next summand xi .
The process of bounding the backward error is called backward error
analysis. The motivation is to interpret rounding errors as perturbations in
the data. Consequently, it reduces the question of estimating the forward
error to perturbation theory. We will see its significance in the following
sections.
To illustrate the forward and backward errors, let us consider the compu-
tation of x̂ − ŷ , where x̂ and ŷ can be previously computed results. Assume
that x and y are exact results and x̂ = x(1 + δx ) and ŷ = y(1 + δy ), then

x̂ ⊖ ŷ = fl(x̂ − ŷ) = (x̂ − ŷ)(1 + δ), |δ| ≤ u.

It then follows that x̂ ⊖ ŷ = x(1 + δx )(1 + δ) − y(1 + δy )(1 + δ). Ignoring the
second order terms δx δ and δy δ and letting δ1 = δx + δ and δ2 = δy + δ, we
get
x̂ ⊖ ŷ = x(1 + δ1 ) − y(1 + δ2 ).
If |δx | and |δy | are small, then |δ1 |, |δ2 | ≤ u + max(|δx |, |δy |) are also small
i.e., the backward errors are small. However, the forward error (relative
error)
|(x̂ ⊖ ŷ) − (x − y)| |xδ1 − yδ2 |
Erel = = .
|x − y| |x − y|
If δ1 6= δ2 i.e., δx 6= δy , it is possible that Erel is large when |x − y| is small,
i.e., x and y are close to each other. This is called catastrophic cancellation.
If δx = δy , in particular, if both x and y are original data (δx = δy = 0),
3.3. STABILITY OF AN ALGORITHM 21

then Erel = δ. This is called benign cancellation. The following example

illustrates the difference between the two types of cancellations.

Example 3.2.2 Two algorithms for computing z = x2 − y 2 .

x⊗x⊖y⊗y (x ⊕ y) ⊗ (x ⊖ y)
= (x2 (1 + δ1 ) − y 2 (1 + δ2 ))(1 + δ3 ) = (x + y)(1 + δ1 )(x − y)(1 + δ2 )(1 + δ3 )
≈ (x(1 + δ1 +δ 2
2 )) − (y(1 +
3 δ2 +δ3 2
2 )) = (x2 − y 2 )(1 + δ1 + δ2 + δ3 )
2
= (x(1 + δx )) − (y(1 + δy )) 2 = (x(1 + δx ))2 − (y(1 + δy ))2
|δi | ≤ u, |δx |, |δy | ≤ u |δi | ≤ u, |δx | = |δy | ≤ 3u/2

The above backward analysis shows that both algorithms have small back-
ward errors. However, the following forward error analysis shows that the
algorithm on the left can produce large error when |x| and |y| are close due
to catastrophic cancellation; the algorithm on the right guarantees accurate
result by avoiding catastrophic cancellation.

x⊗x⊖y⊗y (x ⊕ y) ⊗ (x ⊖ y)
= (x2 − y 2 )(1 + δ3 ) + x2 δ1 − y 2 δ2 = (x2 − y 2 )(1 + δ1 + δ2 + δ3 )

|ẑ − z|/|z| |ẑ − z|/|z|

≤ |δ3 | + |x2 δ1 − y 2 δ2 |/|z| = |δ1 + δ2 + δ3 |
≤ u(1 + (x2 + y 2 )/|z|) ≤ 3u.

3.3 Stability of an Algorithm

A method for computing y = f (x) is called backward stable if, for any x, it
produces a computed ŷ with a small backward error, that is, ŷ = f (x + ∆x)
for some small ∆x. Usually there exist many such ∆x. We are interested in
the smallest. If ∆x turns out to be large, then the algorithm is unstable.

Example 3.3.1 Suppose β = 10 and t = 3. Consider the following system:

.001 1.00 1.00
Ax = b, where A = and b = .
1.00 .200 −3.00

Applying Gaussian elimination (without pivoting), we get the computed de-

composition
bU
L b = 1.00 0 .001 1.00
.
1000 1.00 0 −1000
22 CHAPTER 3. ERROR ANALYSIS

Let the computed solution

0
x̂ =
1.00
be the exact solution of the perturbed system

(A + ∆A)x̂ = b.

Solving for ∆A, we get

× 0
∆A = .
× −3.2

The smallest ∆A is
0 0
∆A = ,
0 −3.2
which is of the same size as A. This means that Gaussian elimination
(without pivoting) is unstable. Note that the exact solution is

−3.2 · · ·
x= .
1.0032 · · ·

3.4 Sensitivity of a Problem

Backward analysis processes the rounding error so that the computed result
becomes the exact solution of a perturbed problem. How sensitive is the
solution to the perturbations in the data of a problem?
Let us first revisit the problem of evaluating x2 − y 2 . Suppose that the
data x and y are perturbed by relative errors ǫx and ǫy respectively. We
try to find out how much the relative errors in data can be magnified in the
relative error of the result:
|(x(1 + ǫx ))2 − (y(1 + ǫy ))2 − (x2 − y 2 )|
|x2 − y 2 |
|2x2 ǫx − 2y 2 ǫy |
≈
|x2 − y 2 |
2|x2 + y 2 |
≤ ǫ, ǫ = max(|ǫx |, |ǫy |).
|x2 − y 2 |

The magnification 2|x2 + y 2 |/|x2 − y 2 | is called the condition number for the
problem. When |x| is close to |y| the condition number is large, that is, the
result x2 − y 2 can be very sensitive to the perturbations in x and y.
3.4. SENSITIVITY OF A PROBLEM 23

Note that the condition number is an upper bound for any method.
Example 3.2.2 shows that the algorithm on the left can achieve the upper
bound while the well designed algorithm on the right produces accurate
result even when the condition number is large, that is, |x| is close to |y|.
Then we revisit the problem of solving a system of linear equations:

Ax = b

where A and b are known variables and x is the result. The question is: How
sensitive is x to the change in A and/or b?. We can assume that the change
is only in b since the change in A can be transformed into the change in b.
Let x̆ be the solution of the perturbed system:

Ax̆ = b + ∆b.

The change in x (relative error) is kx̆ − xk/kxk and the change in b is

k∆bk/kbk. We use the ratio of the two errors as the condition number:

kx̆ − xk/kxk kA−1 ∆bk kAxk

cond = = · ≤ kA−1 k kAk.
k∆bk/kbk kxk k∆bk

So kA−1 k kAk is the condition number of the problem of solving a linear

system.

Example 3.4.1 A metal bar with one end fixed on a wall and another end
carrying a mass. To find out how the bar bends, we need to solve a symmetric
pentadiagonal linear system whose coefficient matrix has the form when N =
8:  
9 −4 1
 −4 6 −4 1 
 
 1 −4 6 −4 1 
 
 1 −4 6 −4 1 
 .
 1 −4 6 −4 1 
 
 1 −4 6 −4 1 
 
 1 −4 5 −2 
1 −2 1
The condition number of the matrix grows in the order of N 4 as shown in
the following table.

N 10 100 1000 10000

cond 1.4e+04 1.3e+08 1.3e+12 1.3e+16
24 CHAPTER 3. ERROR ANALYSIS

To reduce the discretization error, we need a large N , usually in thousands.

However, when N is large, the system is ill-conditioned, thus the computed
solution is severely contaminated by rounding errors. The iterative refine-
ment can attenuate the risk.

In general, we can view a problem with data x and result y as a function

y = f (x). The result of the perturbed problem is y̆ = f (x + ∆x). The
sensitivity is measured by

|y̆ − y|/|y| |f (x + ∆x) − f (x)| |x| |x|

cond = = ≈ |f ′ (x)| . (3.4.1)
|∆x|/|x| |∆x| |f (x)| |f (x)|

Note that the conditioning of a problem is independent of rounding errors

and algorithms for solving the problem. The following example is due to
Wilkinson(see [3].

Example 3.4.2 Let p(x) = (x−1)(x−2)...(x−19)(x−20) = x20 −210x19 +

.... The zeros of p(x) are 1, 2, ..., 19, 20 and well separated. With the floating-
point number system of β = 2, t = 30 we enter a typical coefficient into
the computer, it is necessary to round it to 30 significant base-2 digits. If
we make a change in the 30th significant base-2, only one of the twenty
coefficients, the coefficient of x19 , is changed from −210 to −210 + 2−23 . Let
us see how much effect this small change has on the zeros of the polynomial.
Here we list (using β = 2, t = 90) the roots of the equation p(x) + 2−23 x19 =
0, correctly rounded to the number of digits:

1.00000 0000 10.09526 6145 ± 0.64350 0904i

2.00000 0000 11.79363 3881 ± 1.65232 9728i
3.00000 0000 13.99235 8137 ± 2.51883 0070i
4.00000 0000 16.73073 7466 ± 2.81262 4894i
4.99999 9928 19.50243 9400 ± 1.94033 0347i
6.00000 6944
6.99969 7234
8.00726 7603
8.91725 0249
20.84690 8101

Note the small change in the coefficient −210 has caused ten of the zeros
to become complex and that two have moved more than 2.81 units off the
real axis. That means the zeros of p(x) are very sensitive to the change in
coefficients. The results were computed under a very accurate computation.
3.4. SENSITIVITY OF A PROBLEM 25

They did not get any side effects from rounding errors, and nor is it a
problem that the algorithm used solve this problem make some ill-effects on
the results. Actually the problem is the matter of sensitivity itself.

As discussed before, backward error analysis transforms rounding errors

into perturbations of data. Thus we can establish a relation between forward
and backward errors and the conditioning of the problem. Clearly, (3.4.1)
shows that
Eforward ≤ cond · Ebackward .
This inequality tells us that large forward errors can be caused by either ill-
conditioning of the problem or unstable algorithm, or both. The significance
of backward error analysis is that it allows us to determine whether an
algorithm is stable (small backward errors). If we can prove the algorithm
is stable, in other words, the backward errors are small, say, no larger than
the measurement errors in data, then we know that large forward errors are
due to the ill-conditioning of the problem. On the other hand, if we know
the problem is well-conditioned, then large forward errors must be caused
by unstable algorithm.
As discussed in Section 2.4, dynamic directed rounding can be used to
test a program’s sensitivity to perturbations.
26 CHAPTER 3. ERROR ANALYSIS
Bibliography

[1] David Goldberg. What every computer scientist should know about
floating-point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991.

[2] Donald E. Knuth. The Art of Computer Programming, volume II.

Addison-Wesley, Reading Mass,, 2 edition, 1997.

[3] J.H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice Hall,

Englewood Cliffs, N.J., 1963.

27
Index

2-norm, 7 precision, 1

backward error, 16 rounding modes, 10

backward stable, 19
base, 1 scaling, 7
benign cancellation, 18 signed zeros, 5
biased representation, 3 single precision, 3

catastrophic cancellation, 18 truncation error, 13

condition number, 20
ulp, 2
correctly rounded operations, 8
underflow, 7
denormals, 5 unit of roundoff, 2
discretization error, 13

exponent range, 1
extra precision, 9

forward error, 16

gradual underflow, 6

hidden bit, 1

IEEE values, 6
infinities, 4
iterative refinement, 9

long double, 9

machine precision, 1

NaN, 4

overflow, 7

Floating Point
No ratings yet
Floating Point
3 pages
Binary Number Representations
No ratings yet
Binary Number Representations
14 pages
SKJ Notes Error Analysis - Module 1
No ratings yet
SKJ Notes Error Analysis - Module 1
11 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
COA Unit 2
No ratings yet
COA Unit 2
23 pages
FALLSEM2025-26 VL BACSE103 00100 ETH 2025-09-17 Mod-3-Part-2
No ratings yet
FALLSEM2025-26 VL BACSE103 00100 ETH 2025-09-17 Mod-3-Part-2
28 pages
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
Unit 2
No ratings yet
Unit 2
16 pages
Arithmetic Operations On Binary Numbers: Two's Complement Addition
No ratings yet
Arithmetic Operations On Binary Numbers: Two's Complement Addition
11 pages
Unit III CAO
No ratings yet
Unit III CAO
39 pages
Floating Point Representation of Numbers: Wide Range
No ratings yet
Floating Point Representation of Numbers: Wide Range
11 pages
Operations On Floating Point Numbers
No ratings yet
Operations On Floating Point Numbers
16 pages
EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
No ratings yet
EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
31 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Understanding Rounding Errors in Computing
No ratings yet
Understanding Rounding Errors in Computing
34 pages
ME3105 Notes 1
No ratings yet
ME3105 Notes 1
10 pages
Numerical Analysis Lecture Notes: 1. Computer Arithmetic
No ratings yet
Numerical Analysis Lecture Notes: 1. Computer Arithmetic
6 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
CH10 COA10e
No ratings yet
CH10 COA10e
48 pages
CH08.2-Computer Arithmetic
No ratings yet
CH08.2-Computer Arithmetic
14 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
What Are Floating Point Numbers?
No ratings yet
What Are Floating Point Numbers?
7 pages
Cse 321 4 5
No ratings yet
Cse 321 4 5
11 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
No ratings yet
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
8 pages
Single Precision Floating Point
No ratings yet
Single Precision Floating Point
24 pages
HW 4 Sol
No ratings yet
HW 4 Sol
10 pages
Unit 1
No ratings yet
Unit 1
19 pages
Numerical Computing MCSE004
No ratings yet
Numerical Computing MCSE004
192 pages
DSP Finite Word Length Effects
No ratings yet
DSP Finite Word Length Effects
26 pages
Booth and Radix-4 Questions
No ratings yet
Booth and Radix-4 Questions
8 pages
Lect 02&03
No ratings yet
Lect 02&03
22 pages
Floating Point
No ratings yet
Floating Point
33 pages
Floating Point Arithmetic Guide
No ratings yet
Floating Point Arithmetic Guide
42 pages
Computer Arithmetic Explained
No ratings yet
Computer Arithmetic Explained
60 pages
Computer Arithmetic & Error Analysis
No ratings yet
Computer Arithmetic & Error Analysis
60 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
5 pages
CSC340 - HW3
No ratings yet
CSC340 - HW3
28 pages
Numerical Methods I - Roundoff Errors
No ratings yet
Numerical Methods I - Roundoff Errors
46 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
IEEE Standard 754 Floating Point Numbers
No ratings yet
IEEE Standard 754 Floating Point Numbers
7 pages
Floating Point Arithmetic Guide
No ratings yet
Floating Point Arithmetic Guide
158 pages
Finite Word Length Effects
No ratings yet
Finite Word Length Effects
31 pages
Floating Point Representation
No ratings yet
Floating Point Representation
7 pages
Representing Fractions - Fixed Point: The Problem
No ratings yet
Representing Fractions - Fixed Point: The Problem
35 pages
Floating Points
No ratings yet
Floating Points
31 pages
IEEE Floating-Point in Matlab
No ratings yet
IEEE Floating-Point in Matlab
9 pages
Numerical Errors
No ratings yet
Numerical Errors
23 pages
Fixed vs Floating Point Numbers
No ratings yet
Fixed vs Floating Point Numbers
20 pages
Cit335 Summary
No ratings yet
Cit335 Summary
10 pages
Floating Point 6up
No ratings yet
Floating Point 6up
7 pages
An Introduction To Floating Point Arithmetic by Example: Pat Quillen
No ratings yet
An Introduction To Floating Point Arithmetic by Example: Pat Quillen
33 pages
Computations in Mechanical Engineering: Numbers and Vectors
No ratings yet
Computations in Mechanical Engineering: Numbers and Vectors
18 pages
CS137Part03 Floats Mathlib Root Finding Post
No ratings yet
CS137Part03 Floats Mathlib Root Finding Post
51 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
No ratings yet
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
39 pages
3 Fixed and Floating Point DSP
No ratings yet
3 Fixed and Floating Point DSP
23 pages
Ieee Tex
No ratings yet
Ieee Tex
4 pages
LAB3
No ratings yet
LAB3
6 pages
23 CombinationalLogic
No ratings yet
23 CombinationalLogic
36 pages
LAB1
No ratings yet
LAB1
6 pages
LAB2
No ratings yet
LAB2
7 pages
14 NumberSystem
No ratings yet
14 NumberSystem
14 pages
31 CombinationalCircuits
No ratings yet
31 CombinationalCircuits
20 pages
24 CombinationalLogic
No ratings yet
24 CombinationalLogic
19 pages
13 NumberSystem
No ratings yet
13 NumberSystem
41 pages
22 CombinationalLogic
No ratings yet
22 CombinationalLogic
34 pages
08 Git
No ratings yet
08 Git
37 pages
06 Cmdline
No ratings yet
06 Cmdline
26 pages
Lab 05
No ratings yet
Lab 05
3 pages
04 Rmarkdown
No ratings yet
04 Rmarkdown
40 pages
Lab 03
No ratings yet
Lab 03
3 pages
9 - Communica+ng Applica+ons: Advanced Compu+ng Tools For Applied Research
No ratings yet
9 - Communica+ng Applica+ons: Advanced Compu+ng Tools For Applied Research
30 pages
Chapter 1. Introduction To Software Engineering: Advanced Computing Tools For Applied Research
No ratings yet
Chapter 1. Introduction To Software Engineering: Advanced Computing Tools For Applied Research
17 pages
Alfresco 34
No ratings yet
Alfresco 34
10 pages
Image To Hexadecimal Converter - Capitalize My Title
No ratings yet
Image To Hexadecimal Converter - Capitalize My Title
2 pages
Class 10 Loop
No ratings yet
Class 10 Loop
16 pages
An Algorithm For Multiplication Modulo (2 N-1)
No ratings yet
An Algorithm For Multiplication Modulo (2 N-1)
5 pages
3.1.3 Real Numbers and Normalised Floating-Point Numbers
100% (2)
3.1.3 Real Numbers and Normalised Floating-Point Numbers
24 pages
MPMC - Practical Question Paper Set-01
No ratings yet
MPMC - Practical Question Paper Set-01
4 pages
Low-Power-Delay-Product Radix-4 8 8 Booth Multiplier in CMOS
No ratings yet
Low-Power-Delay-Product Radix-4 8 8 Booth Multiplier in CMOS
2 pages
Coe117 Final Exam q4 2015 2016 Logic Circuits Final Exam q3 2014 2015 Problems and Answer Key
No ratings yet
Coe117 Final Exam q4 2015 2016 Logic Circuits Final Exam q3 2014 2015 Problems and Answer Key
3 pages
PROGRAMMING 11 - Q1 - W2 - Mod2 PDF
100% (1)
PROGRAMMING 11 - Q1 - W2 - Mod2 PDF
13 pages
Math
No ratings yet
Math
31 pages
Assigned Byte 0 Assigned Byte 1 B0 B1: Computer Organization & Assembly Language ASSIGNMENT#2 (FALL 2023)
No ratings yet
Assigned Byte 0 Assigned Byte 1 B0 B1: Computer Organization & Assembly Language ASSIGNMENT#2 (FALL 2023)
37 pages
HMI PLC Connection Manual
100% (2)
HMI PLC Connection Manual
110 pages
Fast Hub Floating Point Adder
No ratings yet
Fast Hub Floating Point Adder
5 pages
Chapter 3-CombinationalLogicDesign
No ratings yet
Chapter 3-CombinationalLogicDesign
53 pages
Experiment 7
No ratings yet
Experiment 7
6 pages
Representation of Negative Numbers and
No ratings yet
Representation of Negative Numbers and
26 pages
Chapter 3 Arithmetic For Computers
No ratings yet
Chapter 3 Arithmetic For Computers
43 pages
DELD Lab Manual 2020
No ratings yet
DELD Lab Manual 2020
120 pages
Digital Circuits Problem Solving
No ratings yet
Digital Circuits Problem Solving
9 pages
II Semester Course Digital Logic Design: UNIT - I Number Systems
No ratings yet
II Semester Course Digital Logic Design: UNIT - I Number Systems
32 pages
C++ Programming Activities Guide
No ratings yet
C++ Programming Activities Guide
26 pages
Laboratory Exercise 7
No ratings yet
Laboratory Exercise 7
4 pages
Digital Logic: 32-Bit Adder Guide
No ratings yet
Digital Logic: 32-Bit Adder Guide
5 pages
Operators Part 2 - Python
No ratings yet
Operators Part 2 - Python
3 pages
Module 1 Number System and Codes Question Bank Feb 2018 PDF
No ratings yet
Module 1 Number System and Codes Question Bank Feb 2018 PDF
3 pages
CX Programer-Help-Instruction Sets - C-Series
No ratings yet
CX Programer-Help-Instruction Sets - C-Series
8 pages
Computer Arithmetic Essentials
No ratings yet
Computer Arithmetic Essentials
21 pages
Assembly Language Flags Guide
No ratings yet
Assembly Language Flags Guide
23 pages
Btech 4 Sem Digital Electronics 100403 2023
No ratings yet
Btech 4 Sem Digital Electronics 100403 2023
1 page
Labview Programming
No ratings yet
Labview Programming
20 pages

Elements of Floating-Point Arithmetic-07

Uploaded by

Elements of Floating-Point Arithmetic-07

Uploaded by

Elements of Floating-point Arithmetic

Sanzheng Qiao Yimin Wei

defined as the unit of roundoff denoted by u. When β = 2, u = 2−t .

1.2 A Small System

β t emin emax Exponent width Format width

Example The decimal number 0.1 cannot be represented exactly in the

1.3 IEEE Floating-point Standard

single precision double precision

Single precision word format:

1.3.1 Special quantities

Example 1.3.1 Consider two floating-point numbers x = 1.000 × 25 and

With infinities, if we compute sin θ, x2 overflows to +∞ and the computation

Operations that produce a NaN

In IEEE 754, NaNs are represented as floating-point numbers with ex-

Figure 1.3.1: Denormalized numbers (between 0 and 2−6 marked by ×).

Exponent Fraction Represents

Table 1.1: IEEE 754 values

2.1 Overflow and Underflow

Overflow means that the exponent is too large to be represented in the

2.1.1 Avoiding unnecessary over/underflow

Scaling is a commonly used technique to avoid unnecessary underflow and

Without scaling With scaling

2.2 Correctly Rounded Operations

• arithmetic operations +, −, ∗, and /

• square root and remainder

The format conversions include:

• between floating-point formats, e.g., from single to double.

• between floating-point and integer formats.

• from floating-point to integer value (not format).

• between binary and decimal.

Correctly rounded means that result must be the same as if it were

number. For example, if ⊕ denotes the floating-point addition, then given

2.3 Extra Precision

1. Compute the residual r = Ax − b;

x = (1.0 + 1.25e-1)*(1.0 - 1.25e-1) - 1.0;

2.4 Rounding Modes

Example Consider two floating point numbers: a = b = 1.010 × 21 in our

Table 2.1: Four rounding modes

3.1 Sources of Errors

When a continuous problem is approximated by a discrete one, dis-

Table 3.1: Values of yh (1) and errors using various sizes of h

Figure 3.1.1: Total Error

that ex is computed accurately, that is, the computed ex equals ex (1 + δ)

Figure 3.1.1 plots Etotal .

3.2 Forward and Backward Errors

then ∆x = 2xδ + xδ2 . Thus, ignoring δ2 , we have the backward error:

|∆x|/|x| ≈ 2|δ| ≤ 2u.

Example 3.2.1 Computing the sum:

Denote the partial sum si = x1 ⊕ x2 ⊕ · · · ⊕ xi , then

A more rigorous bound: If nu ≤ 0.1 and |ǫi | ≤ u (i = 1, 2, ..., n), then

How bad is the accumulation error? Using double precision, u = 10−15 ,

s = x(1); % partial sum

Then the computed sum s is equal to

x̂ ⊖ ŷ = fl(x̂ − ŷ) = (x̂ − ŷ)(1 + δ), |δ| ≤ u.

then Erel = δ. This is called benign cancellation. The following example

Example 3.2.2 Two algorithms for computing z = x2 − y 2 .

|ẑ − z|/|z| |ẑ − z|/|z|

3.3 Stability of an Algorithm

Example 3.3.1 Suppose β = 10 and t = 3. Consider the following system:

Applying Gaussian elimination (without pivoting), we get the computed de-

Let the computed solution  

Solving for ∆A, we get

3.4 Sensitivity of a Problem

The change in x (relative error) is kx̆ − xk/kxk and the change in b is

kx̆ − xk/kxk kA−1 ∆bk kAxk

So kA−1 k kAk is the condition number of the problem of solving a linear

N 10 100 1000 10000

To reduce the discretization error, we need a large N , usually in thousands.

In general, we can view a problem with data x and result y as a function

|y̆ − y|/|y| |f (x + ∆x) − f (x)| |x| |x|

Note that the conditioning of a problem is independent of rounding errors

Example 3.4.2 Let p(x) = (x−1)(x−2)...(x−19)(x−20) = x20 −210x19 +

1.00000 0000 10.09526 6145 ± 0.64350 0904i

As discussed before, backward error analysis transforms rounding errors

[2] Donald E. Knuth. The Art of Computer Programming, volume II.

[3] J.H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice Hall,

Let the computed solution