[go: up one dir, main page]

0% found this document useful (0 votes)
200 views21 pages

Numerical Analysis for STEM Students

1. Numerical analysis deals with developing approximation procedures called numerical methods to solve mathematical problems, developing algorithms to implement these methods, and analyzing errors. 2. Real numbers cannot be stored exactly in computers due to memory restrictions, so they are approximated using floating-point representation. This introduces arithmetic errors. 3. Total error in numerical solutions equals the mathematical error from the numerical method plus the arithmetic error from floating-point representation in computers. Understanding errors is important in numerical analysis.

Uploaded by

gas021018
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
200 views21 pages

Numerical Analysis for STEM Students

1. Numerical analysis deals with developing approximation procedures called numerical methods to solve mathematical problems, developing algorithms to implement these methods, and analyzing errors. 2. Real numbers cannot be stored exactly in computers due to memory restrictions, so they are approximated using floating-point representation. This introduces arithmetic errors. 3. Total error in numerical solutions equals the mathematical error from the numerical method plus the arithmetic error from floating-point representation in computers. Understanding errors is important in numerical analysis.

Uploaded by

gas021018
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction

1
1.1 Motivation
Before going to study the course, let us try to motivate ourselves by asking the following
five Wh questions.

1. What is Numerical Analysis ?

2. Where is Numerical Analysis used ?

3. When is Numerical Analysis required ?

4. Who requires Numerical Analysis ?

5. Why should we learn Numerical Analysis ?

And, finally we ask the following:

How to do Numerical Analysis ?

1. What is Numerical Analysis ?


Numerical Analysis is a branch of mathematics that deals with

(a) Developing approximation procedure (called Numerical Methods) to solve math-


ematical problems.

1
1. Introduction 2

(b) Developing efficient algorithm to implement the above approximation proce-


dure as computer codes (called Implementation).
(c) Analyzing error for the above approximation procedure (a bit of error analysis,
convergence, stability).

Numerical Analysis is not about playing with numerical values in computer. It is


about mathematical ideas, insights.
What are Numerical methods?
They are algorithms to compute:

(i) approximations of functions,


(ii) approximations of derivatives of functions at some point,
(iii) approximations of integrals of functions in some interval,
(iv) approximations of solution/s of linear/nonlinear single/system of algebraic/differential
equation/s.

2. Where is Numerical Analysis used ?


(I will put a picture describing whole situation.)

3. When is Numerical Analysis required ?


In the following situation one can go for numerical approximations:

(a) The mathematical problem is known to have solution but does not have any
method to obtain exact analytical expression.
(b) Mathematical problem is very difficult to handle exactly.
(c) There is a lack of enough data about the inputs for the mathematical problem
and therefore not possible to use the available exact methods.

4. Who requires Numerical Analysis ?


Everyone working in the field of science and engineering field. For eg:

• Mathematicians,
• Statisticians,
• Physicists,
• Chemists,
• Engineers, etc.

5. Why should we learn Numerical Analysis ?


There are at least three reasons for which we need to learn and understand numerical
methods.
3 Syllabus we are going to cover

(a) Learning different numerical methods and their analysis will make a person
more familiar with the techniques of developing new numerical methods. This
is important because when the available methods are not enough or not efficient
for a specific problem to solve.
(b) In many situations, one has more methods for a given problem. Hence choosing
an appropriate method is important for producing an accurate result in lesser
time.
(c) With a good enough knowledge of numerical methods, one can use it properly
and efficiently to get solution of problems and understand what is going wrong
if the results obtained are not expected.

Through the rest of the course we will answer ’How to do numerical analysis?’

1.2 Syllabus we are going to cover


(Already I mentioned in the class. This portion will be filled.)

1.3 Prerequisite
(Already I mentioned in the class. This portion will be filled.)
1. Introduction 4
Floating Point Approximation of Real
2
Numbers and Error Analysis

We have known that Numerical Analysis is all about

• Developing approximation procedure called numerical methods, to approximate a


solution of a given Mathematical problem (whenever a solution exists),

• Developing algorithms and implement them as computer codes,

• Analyzing error in numerical approximation.

Therefore, the approximate solution, obtained by numerical method, will involve an error
which is precisely the difference between the exact solution and the approximate solution.
Thus, we have

Exact Solution = Approximate Solution + Error.

This error is called the Mathematical error.


In the next step while developing the algorithm and implementing them as computer code,
we use a set of numerical values to evaluate the approximate solution obtained in
numerical method. After evaluation we obtain a set of numerical values which is called
the numerical solution to the given Mathematical problem. Due to memory restrictions,
a computing device can store only a finite number of digits. Therefore, a real number
cannot be stored exactly. Certain approximation needs to be done, and only an

5
2. Floating Point Approximation of Real Numbers and Error Analysis 6

approximate value of the given number is stored in the device. For further calculations,
this approximate value is used instead of the exact value of the number. Hence, during the
process of computation, the computer introduces a new error, called the arithmetic error.
Then, we note that

Approximate Solution = Numerical Solution + Arithmetic Error.

Therefore, we have

Exact Solution = Numerical Solution + Mathematical Error + Arithmetic Error.

The Total Error is defined as

Total Error = Mathematical Error + Arithmetic Error.


7 Floating-Point Representation

The error (arithmetic error) involved in the numerical solution when compared to the
exact solution can be worse than the mathematical error.
In this chapter, we introduce the floating-point representation of a real number and see
two method to obtain floating-point approximation of a given real number. We introduce
different types of errors that we come across in numerical analysis and their effects in the
computation. At the end of this chapter, we will be familiar with the arithmetic errors,
their effect on computed results and some ways to minimize this error in the computation.

2.1 Floating-Point Representation


Let β ∈ N, with β ⩾ 2. Let x ∈ R be a real number. The real number x can be represented
in base β as

x := (−1)s × (.d1 d2 . . . dt dt+1 . . . )β × β e , (2.1.1)

where di ∈ {0, 1, . . . , β − 1} for all i ∈ N with d1 ̸= 0 or di = 0, for all i ∈ N, s ∈ {0, 1},


and e ∈ Z. Here in above
• β is called the base or radix,

• s is called the sign,

• (.d1 d2 . . . dn dn+1 . . . )β is called the mantissa,

• e is called the exponent,


of the given number x. The representation (2.1.1) of a real number is called the
floating-point representation. The mantissa part can also be written as

di
(.d1 d2 . . . dn dn+1 . . . )β = ∑ i
. (2.1.2)
i=1 β

We note that the right hand side of (2.1.2) is convergent series in R and hence is well
defined.
Remark 2.1.1. We note the following.
• When β = 2, the floating-point representation (2.1.1) is called the binary
floating-point representation.

• When β = 10, the floating-point representation (2.1.1) is called the decimal


floating-point representation.
There are other representation of real numbers. For example octal with β = 8,
hexadecimal (used in ancient China) with β = 16 , vigesimal (used in France) with
β = 20 etc. In the course we will consider β = 10.
2. Floating Point Approximation of Real Numbers and Error Analysis 8

2.1.1 Floating-Point Number System in a Computing Device


Due to memory restrictions, a computing device can store only a finite number of digits in
the mantissa. In this section, we introduce the floating-point approximation and discuss
how a given real number can be approximated.
A computing device stores a real number with only a finite number of digits in the
mantissa. Although different computing devices have different ways of representing the
numbers, here we introduce a mathematical form of this representation, which we will use
throughout this course.

Definition 2.1.1 (n-digit floating point number). Let β ∈ N, with β ⩾ 2. An n-digit


floating point number x f in base β is of the form

x f := (−1)s × (.d1 d2 . . . dn )β × β e (2.1.3)

where di ∈ {0, 1, . . . , β − 1} for all i = 1, 2, . . . , n, with d1 ̸= 0 or di = 0, for all


i = 1, 2, . . . , n, s ∈ {0, 1}, e ∈ Z is an appropriate exponent. Here

n
di
(.d1 d2 . . . dn )β = ∑ i
. (2.1.4)
i=1 β

The number of digits n in the mantissa is called the precision or length of the
floating-point number.

Remark 2.1.2. We note the following.

• When β = 2, the n-digit floating-point representation (2.1.1) is called the n-digit


binary floating-point representation. This is used in computer.

• When β = 10, the n-digit floating-point representation (2.1.1) is called the n-digit
decimal floating-point representation. It is used in daily life computation.

Example 2.1.1. The following are examples of real numbers in the decimal floating point
representation.

(i). The real number x f = 2347 is represented in the decimal floating-point


representation as
x f = (−1)0 × .2347 × 104 .

In the above, we note that β = 10, s = 0, d1 = 2, d2 = 3, d3 = 4, d4 = 7, e = 1.

(ii). The real number y f = −0.000739 is represented in the decimal floating-point


representation as
y f = (−1)1 × .739 × 10−3 .

In the above, we note that β = 10, s = 1, d1 = 7, d2 = 3, d3 = 9, e = −3.


9 Floating-Point Representation

Remark 2.1.3. Any computing device has its own memory limitations in storing a real
number. In terms of the floating-point representation, these limitations lead to the
restrictions in the range of the exponent (e) and the number of digits in the mantissa (n).

2.1.2 Characterization of Floating-Point Number System in a


Computing Device
For a given computing device, there are integers L,U, n ∈ Z such that the exponent e is
limited to a range
L ⩽ e ⩽ U,

and n is the precision. Therefore, a floating point number system, denoted by F, is


characterized by four integer β , n ∈ N, L,U ∈ Z such that

• β = base or radix,

• n= precision,

• L=lower bound of the exponent range,

• U=upper bound of the exponent range.

Any floating point number x f ∈ F, can be represented as in (2.1.3).

2.1.3 Properties of F
(i). The n-digit floating-point representation of a number is unique.

(ii). A floating point number system F is finite and discrete.

(iii). The total numbers in a given floating point number system is

2(β − 1)β n−1 (U − L + 1) + 1.

Indeed, while counting the numbers, there are

• 2 choices of the sign s,


• (β − 1) number of choices of leading digit d1 ,
• β number of choices of each of the remaining digits d2 , d2 , . . . dn ,
• (U − L + 1) number of possible values of the exponent e.

Therefore, in total 2(β − 1)β n−1 (U − L + 1) numbers. Finally, 1 for number zero.
2. Floating Point Approximation of Real Numbers and Error Analysis 10

(iv). The mantissa m(x f ) := (.d1 d2 . . . dn )β = ∑ni=1 βdii satisfies

1
⩽ m(x f ) < 1.
β

(v). The smallest positive floating point number is

UFL = β L−1 .

Indeed, for the smallest positive floating point number

• the sign, s = 0,
• the leading digit, d1 = 1,
• the remaining digits, d2 = d2 = · · · = dn = 0,
• the exponent, e = L.

Hence,
1 L
UFL = (.10 . . . , 0)β β L = β = β L−1 .
β
It is also called Under Flow Level (UFL).

(vi). The largest positive floating point number is

OFL = (1 − β −n )β U .

Indeed, for the largest positive floating point number

• the sign, s = 0,
• all the digits, d1 = d2 = d2 = · · · = dn = (β − 1),
• the exponent, e = U.

Hence,

UFL = (.(β − 1)(β − 1) . . . (β − 1))β β U ,


n
(β − 1) U
=∑ β ,
i=1 βi
 
(β − 1) U 1 1
= β 1 + + · · · n−1 ,
β β β
  n 
1
(β − 1) U  1 − β 
= β   ,
β 1− 1 β
(β − 1) U −n
(1 − β )
= β (β −1)
,
β
β
11 Floating-Point Representation

= (1 − β −n )β U .

It is also called Under Flow Level (UFL).

(vii). All floating point number x f ∈ F must satisfies

UFL ⩽ |x f | ⩽ OFL.

(viii). Floating point numbers are not uniformly distributed throughout their range, but
they are equally spaced only between successive powers of β .

(ix). If an arithmetic operation leads to a result in which a computed number x satisfies


|x| > OFL, then, the calculation may terminate. Then error occurs in the
computation. It is called an overflow error.

(x). If an arithmetic operation leads to a result in which a computed number x satisfies


|x| < UFL, then, the the floating point representation of x (note that floating point
representation of real number will be discussed in subsection 2.1.5) will be set to
zero. This also makes error in the computation. It is called an underflow error.

In subsection 2.1.4, we introduce the concept of under and over flow of memory, which is
a result of the restriction in the exponent. The restriction on the length of the mantissa is
discussed in subsection 2.1.5.

2.1.4 Overflow and Underflow of Memory


• When the value of the exponent e in a floating-point number exceeds the maximum
limit of the memory, we encounter the overflow of memory.

• When the value of the exponent e goes below the minimum of the range, then we
encounter underflow.

During a computation,

• if some computed number x f has an exponent e > L (this also occurs if |x f | > OFL)
then we say, the memory overflow occurs.

• if e < U, (this also occurs if |x f | < UFL) we say the memory underflow occurs.

and

Remark 2.1.4. We note the following.

(i). • In the case of overflow of memory in a floating-point number, a computer will


usually produce meaningless results or simply prints the symbol inf or NaN.
2. Floating Point Approximation of Real Numbers and Error Analysis 12

• When a computation involves an undetermined quantity (like


0 × ∞, ∞ − ∞, 0/0), then the output of the computed value on a computer will
be the symbol NaN (means ‘not a number’). For instance, if x f is a sufficiently
large number that results in an overflow of memory when stored on a
computing device, and y f is another number that results in an underflow, then
their product will be returned as NaN.

(ii). On the other hand, we feel that the underflow is more serious than overflow in a
computation. Because, when underflow occurs, a computer will simply consider the
number as zero without any warning. The error is disguised in the computation and
computer will not inform you about that whereas for the case of overflow computer
will give you warning to modify the computer algorithm(code). If a computation
involves such an underflow of memory, then there is a danger of having a large
difference between the actual value and the computed value. However, by writing a
separate subroutine, one can monitor and get a warning whenever an underflow
occurs.

2.1.5 Floating-Point Approximation of Real Numbers


In general, a real number can have infinitely many digits, which a computing device
cannot hold in its memory. Rather, each computing device will have its own limitation on
the length of the mantissa. If a given real number has infinitely many digits in the
mantissa of the floating-point form as in (2.1.1), then the computing device converts this
number into an n-digit floating-point form as in (2.1.3). Such an approximation is called
the floating-point approximation of a real number.
There are many ways to get floating-point approximation of a given real number. Here we
introduce two types of floating-point approximations.

Definition 2.1.2 (Chopping). Let x ∈ R be a real number given in the floating-point


representation (2.1.1) as

x = (−1)s × (.d1 d2 . . . dn dn+1 . . . )β × β e . (2.1.5)

The floating-point approximations of x using n-digit chopping is given by

fl(x) = (−1)s × (.d1 d2 . . . dn )β × β e . (2.1.6)

The floating-point approximations of x using n-digit rounding is given by



(−1)s × (.d d . . . d ) × β e , 0 ⩽ dn+1 < β2 ,
1 2 n β
fl(x) = (2.1.7)
(−1)s × (.d d . . . (d + 1)) × β e , β
⩽ dn+1 < β ,
1 2 n β 2
13 Floating-Point Representation

where

(.d1 d2 . . . (dn + 1))β = (.d1 d2 . . . (dn )β + (. 00 . . . 0} 1)β ,


| {z
(n−1)- times
n
di 1
=∑ i
+ n.
i=1 β β

Remark 2.1.5. We note the following. If for a real number x, dn+1 ⩾ β2 and dn = β − 1,
then in case of n-th digit rounding dn + 1 in fl(x) would be β . Therefore, we put dn = 0
and add 1 to dn−1 . Again, if dn−1 = β − 1, we put dn−1 = 0 and add 1 to dn−2 . We
continue further if required until all the digits are exhausted. It may happen that
d1 = d2 = · · · = dn = β − 1 and dn+1 ⩾ β2 , then for fl(x) we put d1 = 1, all other di = 0 for
i = 2, 3, . . . , n and increase the exponent by 1, if

β
x = (−1)s × (. (β − 1)(β − 1) . . . (β − 1) dn+1 )β × β e , dn+1 ⩾
| {z } 2
n-times

then in case of n-th digit rounding

fl(x) = (−1)s × (.1 00 . . . 0} )β × β e+1 .


| {z
(n−1)-times

Example 2.1.2. (i). The floating-point representation of π is given by

π = (−1)0 × (.31415926 . . . )10 × 101 .

The floating-point approximation of π using 5-digit chopping is given by

fl(π) = (−1)0 × (.31415)10 × 101 .

The floating-point approximation of π using 5-digit rounding is given by

fl(π) = (−1)0 × (.31416)10 × 101 .

because d6 = 9 ⩾ 5. So in fl(π), d5 = 5 + 1 = 6.

Remark 2.1.6. Most of the modern processors, including Intel, uses IEEE 754 standard
format. This format uses 52 bits in mantissa, (64-bit binary representation), 11 bits in
exponent and 1 bit for sign. This representation is called the double precision. number.
When we perform a computation without any floating-point approximation, we say that
the computation is done using infinite precision (also called exact arithmetic).
2. Floating Point Approximation of Real Numbers and Error Analysis 14

2.1.6 Arithmetic Operations of Floating-Point Approximations


In this subsection, we describe the procedure of performing arithmetic operations using
n-digit rounding. The procedure of performing arithmetic operation using n-digit
chopping can be done in a similar way.

Procedure of performing arithmetic operations

Let ⊙ denote any one of the basic arithmetic operations +, −, ×, ÷. Let x, y ∈ R be two
real numbers. The process of computing x ⊙ y using n-digit rounding is as follows.

• Step 1: First, we consider the n-digit rounding approximations fl(x) and fl(y) of the
numbers x and y, respectively.

• Step 2:Then, we perform the calculation fl(x) ⊙ fl(y) using exact arithmetic.

• Step 3: Finally, we consider the n-digit rounding approximations fl(fl(x) ⊙ fl(y)) of


fl(x) ⊙ fl(y).

The result from Step 3 is the value ofx ⊙ y using n-digit rounding.

Example 2.1.3. Let us consider the function f : [0, ∞) → R given by


√ √
f (x) = x( x + 1 − x), x ∈ R.

We evaluate f (100000) using 6-digit rounding arithmetic. We have


√ √
f (100000) = 100000( 100001 − 100000)

We see that

100001 = 316.229347 · · · = .316229347 · · · × 103 ,

and

100000 = 316.227766 · · · = .316227766 · · · × 103 .

Using 6-digit rounding we note that


√ √
fl( 100001) = .316229 × 103 , and fl( 100000) = .316228 × 103 .

Then,
√ √
fl( 100001) − fl( 100000) = .000001 × 103 = .1 × 10−2 .

Hence,
√ √
fl(fl( 100001) − fl( 100000)) = .1 × 10−2 .
15 Types of Errors

Again we see that fl(100000) = .1 × 106 . Then,


√ √
fl(100000)×fl(fl( 100001)−fl( 100000)) = .1×106 ×.1×10−2 = .01×104 = .1×103 .

Finally, we have
fl( f (100000)) = .1 × 103 = 100.

Similarly, using 6-digit chopping, the value of fl( f (100000)) = 200.

2.2 Types of Errors


The approximate representation of a real number obviously differs from the actual
number, whose difference is called an error.

Definition 2.2.1 (Errors). (i). The error in a computed quantity is defined as

Error = True Value – Approximate Value.

(ii). Absolute value of an error is called the absolute error.

(iii). The relative error is a measure of the error in relation to the size of the true value
as given by

Error
Relative Error = , Provided True Value ̸= 0.
True Value

(iv). Absolute value of the relative error is called the absolute relative error.

(v). The percentage error is defined as

Percentage Error = 100 × |Relative Error| .

Remark 2.2.1. Let xA denote the approximation to the real number x ∈ R. We use the
following notations.

E(xA ) := Error(xA ) = x − xA . (2.2.1)


Ea (xA ) := Absolute Error(xA ) = |x − xA |. (2.2.2)
x − xA
Er (xA ) := Relative Error(xA ) = , x ̸= 0. (2.2.3)
x
x − xA
Ear (xA ) := Absolute Relative Error(xA ) = , x ̸= 0. (2.2.4)
x

The absolute error has to be understood more carefully because a relatively small
difference between two large numbers can appear to be large, and a relatively large
2. Floating Point Approximation of Real Numbers and Error Analysis 16

difference between two small numbers can appear to be small. On the other hand, the
relative error gives a percentage of the difference between two numbers, which is usually
more meaningful as illustrated below.

Example 2.2.1.

2.3 Sources of Errors


• Mathematical modeling of a physical problem.

• Uncertainty in physical data or empirical measurements.

• Errors from previous computations.

• Blunders in arithmetic computations or in methods.

• Machine Errors.

• Mathematical Truncation Error.

2.4 Machine Errors


In this chapter we will be discussing Machine Errors.

2.4.1 Errors in Floating-point Approximations


With most of real numbers x ∈ R, we have fl(x) ̸= x. We now measure the errors for
floating for approximations of real numbers.

Theorem 2.4.1. Let x ∈ R, x ̸= 0 be such that UFL ⩽ |fl(x)| ⩽ OFL. Then,



β −n+1 , in case of chopping,
Ear (fl(x)) ⩽
 1 β −n+1 , in case of rounding.
2

Proof. Let us assume that x has floating-point representation



di
x = (−1)s × (.d1 d2 . . . dn dn+1 . . . )β × β e = (−1)s β e ∑ i
.
i=1 β

First, We note that


!
∞ ∞
di di d1
|x| = (−1)s β e ∑ βi = βe ∑ i
⩾ β e ⩾ β e−1 . (2.4.1)
i=1 i=1 β β
17 Machine Errors

We prove the result for two cases separately.

Case of chopping : In case of n-digit chopping


n
di
fl(x) = (−1)s × (.d1 d2 . . . dn )β × β e = (−1)s β e ∑ i
.
i=1 β

Hence,
!
n

di di
Ea (fl(x)) = |x − fl(x)| = (−1)s β e ∑ i − ∑ i ,
i=1 β i=1 β
!

di
= βe ∑ i ,
i=n+1 β
∞ ∞
di β −1
= βe ∑ βi ⩽ β e
∑ βi ,
i=n+1 i=n+1

1
= β e (β − 1) ∑ i
,
i=n+1 β
1
e β n+1 βe
= β (β − 1) = = β e−n .
1 − β1 βn

Using (2.4.1), we have

|x − fl(x)| β e−n
Ear (fl(x)) = ⩽ e−1 = β −n+1 .
|x| β

Case of rounding : Sub-case: 0 ⩽ dn+1 < β2 . We note that

n
s e s e di
fl(x) = (−1) × (.d1 d2 . . . dn )β × β = (−1) β ∑ βi.
i=1

Hence,
!
∞ n
s e di di
Ea (fl(x)) = |x − fl(x)| = (−1) β ∑ i − ∑ i ,
i=1 β i=1 β
!

e di
= β ∑ βi ,
i=n+1
( )
∞ ∞
di dn+1 di
= βe ∑ i = βe n+1
+ ∑ i ,
i=n+1 β β i=n+2 β
(β )
− 1 ∞
1
≤ β e 2 n+1 + (β − 1) ∑ i
β i=n+2 β
 
β β
∵ 0 ⩽ dn+1 < =⇒ 0 ⩽ dn+1 ⩽ − 1, and di ⩽ β − 1
2 2
2. Floating Point Approximation of Real Numbers and Error Analysis 18
( 1 )
β − 2 β n+2
= βe + (β − 1) ,
2β n+1 1 − β1
 
e β −2 1
=β + ,
2β n+1 β n+1
β −2+2 e β 1 e−n
= βe = β = β .
2β n+1 2β n+1 2

Using (2.4.1), we have

|x − fl(x)| 1 β e−n 1 −n+1


Ear (fl(x)) = ⩽ = β .
|x| 2 β e−1 2

β
Case of rounding : Sub-case: 2 ⩽ dn+1 < β . We note that
!
n
di 1
fl(x) = (−1)s × (.d1 d2 . . . (dn + 1))β × β e = (−1)s β e ∑ βi + βn .
i=1

Hence,
!
∞ n
s e di di 1
Ea (fl(x)) = |x − fl(x)| = (−1) β ∑ i − ∑ i − n ,
i=1 β i=1 β β
!

di 1
= βe ∑ i − n ,
i=n+1 β β
∞ ∞ 1 !
di 1 β n+1 1
∵ ∑ i ≤ (β − 1) ∑ i = (β − 1) 1
= n
i=n+1 β i=n+1 β 1− β β
!

1 di
= βe n
− ∑ i
β i=n+1 β
 
e 1 dn+1
≤β −
β n β n+1
 
β
∵ ⩽ dn+1
2
 
e 1 β 1 e−n
≤β − = β .
β n 2β n+1 2

Finally, using (2.4.1), we have

|x − fl(x)| 1 β e−n 1 −n+1


Ear (fl(x)) = ⩽ = β .
|x| 2 β e−1 2

This completes the proof.


19 Machine Errors

2.4.2 Accuracy of Floating-point Approximations


Unit round-off

We now introduce two measures that give a fairly precise idea of the possible accuracy in
a floating-point approximation.
Definition 2.4.1 (Unit round-off (u)). The unit round-off of a computer is denoted by u
and is defined as
n o
u = sup Ear (fl(x)) : x ∈ R, OFL ≤ |x| ≤ UFL .

i.e., it is the maximum possible value of Ear (fl(x)) for any x satisfying the condition as in
Theorem 2.4.1. By Theorem 2.4.1 we note that

β −n+1 , in case of chopping,
u=
 1 β −n+1 , in case of rounding.
2

Alternative characterization of unit round-off

Definition 2.4.2 (Machine Epsilon(emach )). The machine epsilon of a computer is


denoted by εmach and is defined as

εmach = inf δ ∈ F : δ > 0, and fl(1 + δ ) > 1 .

Theorem 2.4.2. 
β −n+1 , in case of chopping,
εmach =
 1 β −n+1 , in case of rounding.
2
Proof. See book: Atkinson: Introduction ot Numerical Analysis, 2ed., p-15.
Remark 2.4.1. From the definition of εmach , we note that for any δ ∈ F with
0 < δ < εmach implies fl(1 + δ ) = 1, i.e., in computer the number 1 + δ is identical with 1.
Remark 2.4.2. we note that εmach ̸= UFL. εmach is determined by the precision i.e., the
number of digits in the mantissa field of Floating-point system, whereas the UFL is
determined by the number of lower bound in the exponent field.
Example 2.4.1.

Maximal Accuracy

Definition 2.4.3 (Maximal Accuracy). The maximal accuracy in a floating-point


representation is denoted my Macc and is defined as

Macc = sup m ∈ Z : m ⩾ 0, and fl(m) = m .
2. Floating Point Approximation of Real Numbers and Error Analysis 20

Remark 2.4.3. From the definition of maximal accuracy, we note that for any
fl(Macc + 1) ̸= Macc + 1.

Theorem 2.4.3.
Macc = β n .

Proof. See book: Atkinson: Introduction ot Numerical Analysis, 2ed., p-16.

2.4.3 Loss of Significance


In place of relative error, we often use the concept of significant digits that is closely
related to relative error.

Definition 2.4.4 (Significant digits). Let us consider a floating-point number system with
base β ∈ N with β ⩾ 2. For x ∈ R, let xA be an approximation of x. Let
 
n
k
o 1 s+1−t
s = sup k ∈ Z : β ⩽ |x| and r = sup t ∈ N : |x − xA | ≤ β
2

Then, we say xA has r significant digits to approximate x.

Example 2.4.2. We find significant digits for the following approximations.


1
(i). x = 3 = .3333 . . . and xA = .333. We note that

10−1 = .1 < x, hence, s = −1

and

|x − xA | = (.0003333 . . . ) ⩽ (.3333 . . . ) × 10−3 ,


1
= × (.6666 . . . ) × 10−3 ,
2
1 1
< × 10−3 = × 10−1+1−3
2 2

Therefore, xA has 3 significant digits to approximate x.

(ii). x = 23.496 and xA = 23.494 We note that

101 < x, hence, s = 1,

and

|x − xA | = (.002) = .2 × 10−2 ,
1
= × .4 × 10−2 ,
2
21 Machine Errors

1 1
< × 10−2 = × 101+1−4
2 2

Therefore, xA has 4 significant digits to approximate x.

(iii). x = 0.02138 and xA = 0.02144 We note that

10−2 < x, hence, s = −2,

and

|x − xA | = (.00006) = .6 × 10−4 ,
1
= × 1.2 × 10−4 ,
2
1
= × .12 × 10−3 ,
2
1 1
< × 10−3 = × 10−2+1−2
2 2

Therefore, xA has 2 significant digits to approximate x.

Remark 2.4.4. (i). We note that

|x − xA | 1 β s−r+1 1 1−r
Ear (xA ) = ⩽ = β .
|x| 2 βs 2

Hence, absolute relative error decreases with an increase in the number of


significant digits.

(ii). Number of significant digits roughly measures the number of leading non-zero digits
of xA that are correct relative to the corresponding digits in the true value x.
However, this is not a precise way to get the number of significant digits as we have
seen in the above examples.

(iii). The role of significant digits in numerical calculations is very important in the sense
that the loss of significant digits may result in drastic amplification of the relative
error as illustrated in the following example.

Example 2.4.3 (Loss of Significance). (i). Let us consider two real numbers

x = 7.6545428 = .76545428 × 101 , y = 7.6544201 = .76544201 × 101 .

Let

xA = 7.6545421 = .76545421 × 101 , yA = 7.6544200 = .76544200 × 101

You might also like