Computational Methods
Computational Methods
Computational Methods
(ii) Explain the use of the mantissa and exponent for its representation on a computer. Why is it
necessary?
In order to represent huge numbers within limited bit, its necessary to have mantissa and exponent for
the representation of number. E.g. It is impossible to represent 1500000010 =
1110010011100001110000002 in 16-bit binary form (1 bit - sign bit, 5 bits – signed exponent, 10 bits –
mantissa). But with the help of mantissa & exponent, it can be easily represented in 16 bits as
1500000010 = (15 x 106)10
= (1111 x 2110)2
Sign Exponent Mantissa
0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1
(iii) Explain the terms “normalization”, “hidden bit” and “bias” in the context of floating point
number representation.
Normalisation
It is the process of removing insignificant digits (zeros) after the decimal point & correspondingly
decreasing the value of the exponent in order to retain the original value of the number. This is to store
additional significant digit during storage. As a result, the most significant bit of the mantissa becomes
non-zero in the binary representation as,
0.00011101 x 20 is normalised to 0. 11101 x 2-3
Hidden bit
The first bit of normalised floating point number is always 1. This bit is not stored in the memory, but
virtually present in the mantissa. This is called Hidden bit.
Bias
In order to represent as unsigned quantity, the exponent is biased by adding a constant to it, in the actual
bits representation of the floating point number. This is called bias.
E.g.:
❖ If 4 bits are allocated for exponent, then the exponent may vary between 010 = 00002 to 1510 = 11112
(Unbiased)
❖ To make it biased, we add (-710) to the exponent, so that the range now becomes -710 to 810 (Biased).
Now -7 is represented as 010 = 00002 & so on, thus avoiding the need for signed bit for exponent.
2. Please refer Answer for 2nd question at the end (Page 5).
3. The sine function can be computed as the sum of an infinite power series
1
Compute the value of sin(x) for these three x values: 0.5, 5, 50. Note that this is an infinite series
and you can only terminate the computation at a term that is very small (i.e. below a certain
tolerance which you can set). Discuss your observations in doing these computations. Do you see
any problem with the direct evaluation of this series on a computer? Why? Suggest any remedy
you may have in overcoming the problem(s).
You can evaluate the series by writing a MATLAB or Python program if you wish. But because
both languages can do their computation to a very high precision, the manifestation of the problem
might not be obvious.
How about the evaluation of the binomial expansion below? Any possible problem
SINE FUNCTION:
Program & Output screenshot:
Used Program:
format long; %Setting the output format to the long fixed-decimal format
and display the value
x=0.5;
Analytical=sin(x);
v=(1e-100); %tolerance limit
Iteration=0;i=0;t=0;
while abs(t)>=v|t==0
t=((-1)^i)*(x^((2*i)+1))/factorial((2*i)+1);
Iteration=Iteration+t;
i=i+1;
end;
fprintf('Sin (%d) evaluation\nFor terminating at term >=%d:\n Numerical
methods Solution: %d\nAnalytical Solution:
%d\n\n',x,v,Iteration,Analytical)
2
Observations:
I set the condition to terminate the computation when one of the infinite series terms becomes less than
10-100 (tolerance limit).
❖ For Sin 0.5 & Sin 5, both the analytical solution & infinite series solution agrees up to 7 significant
digits. But, for Sin 50, it gives indeterminate form NaN.
❖ I observed that, in the 92nd term of infinite series,
• x((2*i)+1) term becomes infinity
• Denominator factorial part also becomes infinity
• So, the term becomes indeterminate
Hence, Sin 50 becomes indeterminate in MATLAB.
Remedies:
❖ Since the largest finite floating-point number that can be represented in MATLAB is
1.797693134862316*10308, infinity is reached in 92 iterations. We have to use software that can go
beyond this.
❖ If tolerance is reduced to 10-17, it could fetch answers other than indeterminate form, but highly
inaccurate.
BINOMIAL EXPANSION:
Program & Output screenshot:
Used Program:
x=50;
n=0.5;
Analytical=(1+x)^n;
v=(1e-20);
Iteration=0;z=0;i=0;
while z>=v|z==0
z=(gamma(n+1)*(x^i))/(gamma(i+1)*gamma((n-i)+1));
Iteration=Iteration+z;
i=i+1;
3
end
fprintf('Analytical value: %d\nBinomial solution:%d\n',Analytical,Iteration)
Observation:
The written program fetches nearly accurate answers for small numbers of x like 0.5. For larger numbers
x deviates a lot.
4. In the lecture notes, I demonstrated that the difference between 8.200 and 8.125 is very different
from the difference between 0.200 and 0.125 using an 8-bit mantissa, even though they should be
the same. Explain the cause of this difference.
❖ The number of bits used to store the whole number part decides the number of bits that can be used
for storing the fraction part. Hence fraction part is approximated according to the availability of bits.
❖ Using 8-bit mantissa representation, 8.200 - 8.125
8.200 - 8.125 = (8.20010 ≃ 1000.00112) - (8.12510 = 1000.00102)
= 8.187510 - 8.12510
Difference in 8-bit mantissa = 0.00012 or 0.062510
Actual difference = 0.07510
❖ Using 8-bit mantissa representation, 0.200 - 0.125
0.200 - 0.125 = (0.20010 ≃ 0.001100112) - (0.12510 = 0.001000002)
= 0.19921875 – 0.12500000
Difference in 8-bit mantissa = 0.000100112 or 0.0742187510
Actual difference = 0.07510
❖ Explanation for the cause of above difference:
The number of bits used to store the whole number part (here for 8 – 4 bits) decides the number of
bits that can be used for storing the fraction part (here 0.200 & 0.125 – 4 bits). Due to less storage
space allocated to fractional part, only lesser number of digits of fractional part are stored. Equivalent
value of stored number is considered for operation (subtraction). So, accuracy of difference between
8.200 and 8.125 is highly varied from true answer. But, whole 8 bits are used for representing 0.200
& 0.125. Since, comparatively higher bits are used for fractional representation, accuracy of
difference between 0.200 and 0.125 is very high.
5. Write a program that implements the Insertion Sort algorithm. Test your implementation with
this set of inputs: 16.2 32.7 -0.5 4.4 21.8 -17.0 6.9 14.1 1.5. Print the results.
Written Program:
x=input('\nEnter the numbers to be sort in the format [number1 number2
.........]\n');
L=length(x);
for i=2:L;
j=i;
while j>1;
if x(j)<x(j-1)
t=x(j-1);
x(j-1)=x(j);
x(j)=t;
end;
j=j-1;
4
end;
end;
disp('Sorted result is');
disp(x)
Result:
-17.0 -0.5 1.5 4.4 6.9 14.1 16.2 21.8 32.7
2. Write a program that converts a given floating point binary number with a 24-bit normalized
mantissa and an 8-bit exponent to its decimal (i.e. base 10) equivalent. For the mantissa, use the
representation that has a hidden bit, and for the exponent use a bias of 127 instead of a sign bit.
Of course, you need to take care of negative numbers in the mantissa also. Use your program to
answer the following questions:
(a) Mantissa: 11110010 11000101 01101010, exponent: 01011000. What is the base-10 number?
(b) What is the largest number (in base 10) the system can represent?
(c) What is the smallest non-zero positive base-10 number the system can represent?
(d) What is the smallest difference between two such numbers? Give your answer in base 10.
(e) How many significant base-10 digits can we trust using such a representation?
Your program should do the computations from first principle, and not use functions already
available in MATLAB or Python, if you are using either language.
Program:
x=input('Enter floating point binary number\n', 's');
if(strlength(x)==32)
sign=str2num(x(1));
e=0;m=0;
for i=2:1:9
e(i-1)=str2num(x(i));
end
5
for j=10:1:32
m(j-9)=str2num(x(j));
end
ec=0;mc=0;
for i=1:1:8
if(mod(e(i),10))
ec=ec+2^(8-i);
end
end
e=ec-127;
for j=1:1:23
if(mod(m(j),10))
mc=mc+2^(-j);
end
end
d=((-1)^sign)*(1+mc)*(2^e);
fprintf('\nThe decimal equivalent representation of %s is : %.7g\n',x,d);
else
fprintf('\nEnter valid floating point binary number\n');
end
(a) Mantissa: 11110010 11000101 01101010, exponent: 01011000. What is the base-10 number?
-3.449986 x 10-12
(b) What is the largest number (in base 10) the system can represent?
6.805647 x 10+38
(c) What is the smallest non-zero positive base-10 number the system can represent?
1.175494 x 10-38
(d) What is the smallest difference between two such numbers? Give your answer in base 10.
Smallest difference is 1.401298464324817 x 10-45
6
I found equivalent decimal for 2 smallest numbers & then found difference using MATLAB as below
image. Though both the smallest number gives same decimal equivalent up to seven significant digits,
there exists difference of 1.401298464324817 x 10-45
Output for 2.d
(e) How many significant base-10 digits can we trust using such a representation?
Seven