Numerical Analysis for STEM Students
Numerical Analysis for STEM Students
1
1.1 Motivation
Before going to study the course, let us try to motivate ourselves by asking the following
five Wh questions.
1
1. Introduction 2
(a) The mathematical problem is known to have solution but does not have any
method to obtain exact analytical expression.
(b) Mathematical problem is very difficult to handle exactly.
(c) There is a lack of enough data about the inputs for the mathematical problem
and therefore not possible to use the available exact methods.
• Mathematicians,
• Statisticians,
• Physicists,
• Chemists,
• Engineers, etc.
(a) Learning different numerical methods and their analysis will make a person
more familiar with the techniques of developing new numerical methods. This
is important because when the available methods are not enough or not efficient
for a specific problem to solve.
(b) In many situations, one has more methods for a given problem. Hence choosing
an appropriate method is important for producing an accurate result in lesser
time.
(c) With a good enough knowledge of numerical methods, one can use it properly
and efficiently to get solution of problems and understand what is going wrong
if the results obtained are not expected.
Through the rest of the course we will answer ’How to do numerical analysis?’
1.3 Prerequisite
(Already I mentioned in the class. This portion will be filled.)
1. Introduction 4
Floating Point Approximation of Real
2
Numbers and Error Analysis
Therefore, the approximate solution, obtained by numerical method, will involve an error
which is precisely the difference between the exact solution and the approximate solution.
Thus, we have
5
2. Floating Point Approximation of Real Numbers and Error Analysis 6
approximate value of the given number is stored in the device. For further calculations,
this approximate value is used instead of the exact value of the number. Hence, during the
process of computation, the computer introduces a new error, called the arithmetic error.
Then, we note that
Therefore, we have
The error (arithmetic error) involved in the numerical solution when compared to the
exact solution can be worse than the mathematical error.
In this chapter, we introduce the floating-point representation of a real number and see
two method to obtain floating-point approximation of a given real number. We introduce
different types of errors that we come across in numerical analysis and their effects in the
computation. At the end of this chapter, we will be familiar with the arithmetic errors,
their effect on computed results and some ways to minimize this error in the computation.
We note that the right hand side of (2.1.2) is convergent series in R and hence is well
defined.
Remark 2.1.1. We note the following.
• When β = 2, the floating-point representation (2.1.1) is called the binary
floating-point representation.
n
di
(.d1 d2 . . . dn )β = ∑ i
. (2.1.4)
i=1 β
The number of digits n in the mantissa is called the precision or length of the
floating-point number.
• When β = 10, the n-digit floating-point representation (2.1.1) is called the n-digit
decimal floating-point representation. It is used in daily life computation.
Example 2.1.1. The following are examples of real numbers in the decimal floating point
representation.
Remark 2.1.3. Any computing device has its own memory limitations in storing a real
number. In terms of the floating-point representation, these limitations lead to the
restrictions in the range of the exponent (e) and the number of digits in the mantissa (n).
• β = base or radix,
• n= precision,
2.1.3 Properties of F
(i). The n-digit floating-point representation of a number is unique.
Therefore, in total 2(β − 1)β n−1 (U − L + 1) numbers. Finally, 1 for number zero.
2. Floating Point Approximation of Real Numbers and Error Analysis 10
1
⩽ m(x f ) < 1.
β
UFL = β L−1 .
• the sign, s = 0,
• the leading digit, d1 = 1,
• the remaining digits, d2 = d2 = · · · = dn = 0,
• the exponent, e = L.
Hence,
1 L
UFL = (.10 . . . , 0)β β L = β = β L−1 .
β
It is also called Under Flow Level (UFL).
OFL = (1 − β −n )β U .
• the sign, s = 0,
• all the digits, d1 = d2 = d2 = · · · = dn = (β − 1),
• the exponent, e = U.
Hence,
= (1 − β −n )β U .
UFL ⩽ |x f | ⩽ OFL.
(viii). Floating point numbers are not uniformly distributed throughout their range, but
they are equally spaced only between successive powers of β .
In subsection 2.1.4, we introduce the concept of under and over flow of memory, which is
a result of the restriction in the exponent. The restriction on the length of the mantissa is
discussed in subsection 2.1.5.
• When the value of the exponent e goes below the minimum of the range, then we
encounter underflow.
During a computation,
• if some computed number x f has an exponent e > L (this also occurs if |x f | > OFL)
then we say, the memory overflow occurs.
• if e < U, (this also occurs if |x f | < UFL) we say the memory underflow occurs.
and
(ii). On the other hand, we feel that the underflow is more serious than overflow in a
computation. Because, when underflow occurs, a computer will simply consider the
number as zero without any warning. The error is disguised in the computation and
computer will not inform you about that whereas for the case of overflow computer
will give you warning to modify the computer algorithm(code). If a computation
involves such an underflow of memory, then there is a danger of having a large
difference between the actual value and the computed value. However, by writing a
separate subroutine, one can monitor and get a warning whenever an underflow
occurs.
where
Remark 2.1.5. We note the following. If for a real number x, dn+1 ⩾ β2 and dn = β − 1,
then in case of n-th digit rounding dn + 1 in fl(x) would be β . Therefore, we put dn = 0
and add 1 to dn−1 . Again, if dn−1 = β − 1, we put dn−1 = 0 and add 1 to dn−2 . We
continue further if required until all the digits are exhausted. It may happen that
d1 = d2 = · · · = dn = β − 1 and dn+1 ⩾ β2 , then for fl(x) we put d1 = 1, all other di = 0 for
i = 2, 3, . . . , n and increase the exponent by 1, if
β
x = (−1)s × (. (β − 1)(β − 1) . . . (β − 1) dn+1 )β × β e , dn+1 ⩾
| {z } 2
n-times
because d6 = 9 ⩾ 5. So in fl(π), d5 = 5 + 1 = 6.
Remark 2.1.6. Most of the modern processors, including Intel, uses IEEE 754 standard
format. This format uses 52 bits in mantissa, (64-bit binary representation), 11 bits in
exponent and 1 bit for sign. This representation is called the double precision. number.
When we perform a computation without any floating-point approximation, we say that
the computation is done using infinite precision (also called exact arithmetic).
2. Floating Point Approximation of Real Numbers and Error Analysis 14
Let ⊙ denote any one of the basic arithmetic operations +, −, ×, ÷. Let x, y ∈ R be two
real numbers. The process of computing x ⊙ y using n-digit rounding is as follows.
• Step 1: First, we consider the n-digit rounding approximations fl(x) and fl(y) of the
numbers x and y, respectively.
• Step 2:Then, we perform the calculation fl(x) ⊙ fl(y) using exact arithmetic.
The result from Step 3 is the value ofx ⊙ y using n-digit rounding.
We see that
√
100001 = 316.229347 · · · = .316229347 · · · × 103 ,
and
√
100000 = 316.227766 · · · = .316227766 · · · × 103 .
Then,
√ √
fl( 100001) − fl( 100000) = .000001 × 103 = .1 × 10−2 .
Hence,
√ √
fl(fl( 100001) − fl( 100000)) = .1 × 10−2 .
15 Types of Errors
Finally, we have
fl( f (100000)) = .1 × 103 = 100.
(iii). The relative error is a measure of the error in relation to the size of the true value
as given by
Error
Relative Error = , Provided True Value ̸= 0.
True Value
(iv). Absolute value of the relative error is called the absolute relative error.
Remark 2.2.1. Let xA denote the approximation to the real number x ∈ R. We use the
following notations.
The absolute error has to be understood more carefully because a relatively small
difference between two large numbers can appear to be large, and a relatively large
2. Floating Point Approximation of Real Numbers and Error Analysis 16
difference between two small numbers can appear to be small. On the other hand, the
relative error gives a percentage of the difference between two numbers, which is usually
more meaningful as illustrated below.
Example 2.2.1.
• Machine Errors.
Hence,
!
n
∞
di di
Ea (fl(x)) = |x − fl(x)| = (−1)s β e ∑ i − ∑ i ,
i=1 β i=1 β
!
∞
di
= βe ∑ i ,
i=n+1 β
∞ ∞
di β −1
= βe ∑ βi ⩽ β e
∑ βi ,
i=n+1 i=n+1
∞
1
= β e (β − 1) ∑ i
,
i=n+1 β
1
e β n+1 βe
= β (β − 1) = = β e−n .
1 − β1 βn
|x − fl(x)| β e−n
Ear (fl(x)) = ⩽ e−1 = β −n+1 .
|x| β
n
s e s e di
fl(x) = (−1) × (.d1 d2 . . . dn )β × β = (−1) β ∑ βi.
i=1
Hence,
!
∞ n
s e di di
Ea (fl(x)) = |x − fl(x)| = (−1) β ∑ i − ∑ i ,
i=1 β i=1 β
!
∞
e di
= β ∑ βi ,
i=n+1
( )
∞ ∞
di dn+1 di
= βe ∑ i = βe n+1
+ ∑ i ,
i=n+1 β β i=n+2 β
(β )
− 1 ∞
1
≤ β e 2 n+1 + (β − 1) ∑ i
β i=n+2 β
β β
∵ 0 ⩽ dn+1 < =⇒ 0 ⩽ dn+1 ⩽ − 1, and di ⩽ β − 1
2 2
2. Floating Point Approximation of Real Numbers and Error Analysis 18
( 1 )
β − 2 β n+2
= βe + (β − 1) ,
2β n+1 1 − β1
e β −2 1
=β + ,
2β n+1 β n+1
β −2+2 e β 1 e−n
= βe = β = β .
2β n+1 2β n+1 2
β
Case of rounding : Sub-case: 2 ⩽ dn+1 < β . We note that
!
n
di 1
fl(x) = (−1)s × (.d1 d2 . . . (dn + 1))β × β e = (−1)s β e ∑ βi + βn .
i=1
Hence,
!
∞ n
s e di di 1
Ea (fl(x)) = |x − fl(x)| = (−1) β ∑ i − ∑ i − n ,
i=1 β i=1 β β
!
∞
di 1
= βe ∑ i − n ,
i=n+1 β β
∞ ∞ 1 !
di 1 β n+1 1
∵ ∑ i ≤ (β − 1) ∑ i = (β − 1) 1
= n
i=n+1 β i=n+1 β 1− β β
!
∞
1 di
= βe n
− ∑ i
β i=n+1 β
e 1 dn+1
≤β −
β n β n+1
β
∵ ⩽ dn+1
2
e 1 β 1 e−n
≤β − = β .
β n 2β n+1 2
We now introduce two measures that give a fairly precise idea of the possible accuracy in
a floating-point approximation.
Definition 2.4.1 (Unit round-off (u)). The unit round-off of a computer is denoted by u
and is defined as
n o
u = sup Ear (fl(x)) : x ∈ R, OFL ≤ |x| ≤ UFL .
i.e., it is the maximum possible value of Ear (fl(x)) for any x satisfying the condition as in
Theorem 2.4.1. By Theorem 2.4.1 we note that
β −n+1 , in case of chopping,
u=
1 β −n+1 , in case of rounding.
2
Theorem 2.4.2.
β −n+1 , in case of chopping,
εmach =
1 β −n+1 , in case of rounding.
2
Proof. See book: Atkinson: Introduction ot Numerical Analysis, 2ed., p-15.
Remark 2.4.1. From the definition of εmach , we note that for any δ ∈ F with
0 < δ < εmach implies fl(1 + δ ) = 1, i.e., in computer the number 1 + δ is identical with 1.
Remark 2.4.2. we note that εmach ̸= UFL. εmach is determined by the precision i.e., the
number of digits in the mantissa field of Floating-point system, whereas the UFL is
determined by the number of lower bound in the exponent field.
Example 2.4.1.
Maximal Accuracy
Remark 2.4.3. From the definition of maximal accuracy, we note that for any
fl(Macc + 1) ̸= Macc + 1.
Theorem 2.4.3.
Macc = β n .
Definition 2.4.4 (Significant digits). Let us consider a floating-point number system with
base β ∈ N with β ⩾ 2. For x ∈ R, let xA be an approximation of x. Let
n
k
o 1 s+1−t
s = sup k ∈ Z : β ⩽ |x| and r = sup t ∈ N : |x − xA | ≤ β
2
and
and
|x − xA | = (.002) = .2 × 10−2 ,
1
= × .4 × 10−2 ,
2
21 Machine Errors
1 1
< × 10−2 = × 101+1−4
2 2
and
|x − xA | = (.00006) = .6 × 10−4 ,
1
= × 1.2 × 10−4 ,
2
1
= × .12 × 10−3 ,
2
1 1
< × 10−3 = × 10−2+1−2
2 2
|x − xA | 1 β s−r+1 1 1−r
Ear (xA ) = ⩽ = β .
|x| 2 βs 2
(ii). Number of significant digits roughly measures the number of leading non-zero digits
of xA that are correct relative to the corresponding digits in the true value x.
However, this is not a precise way to get the number of significant digits as we have
seen in the above examples.
(iii). The role of significant digits in numerical calculations is very important in the sense
that the loss of significant digits may result in drastic amplification of the relative
error as illustrated in the following example.
Example 2.4.3 (Loss of Significance). (i). Let us consider two real numbers
Let